STRUCTURE AND MOTION FROM DEPTH AND CORRESPONDENCE MODELS

By

Shengjie Zhu

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computer Science—Doctor of Philosophy

2025

ABSTRACT

Recovering structure and motion from videos is a well-studied comprehensive 3D vision task that

involves (1) image calibration, (2) two-view pose initialization, and (3) multi-view Structure-from-

Motion (SfM). Prior arts are optimization-based methods built over sparse image correspondence

inputs. This thesis develops systematic approaches to enhance classic solutions with deep learning

models. We introduce EdgeDepth and PMatch for dense monocular depthmaps and dense binocular

correspondence map estimations. Since classic approaches typically rely on sparse and accurate

inputs, they are less suitable for the dense yet high-variance predictions from dense depth and

correspondence models. As a solution, we propose to optimize through the robust inlier-counting-

based scoring function, which is widely applied in RANdom SAmpling Consensus (RANSAC).

Our system is structured as follows: (1) For image calibration, we introduce WildCamera. The

system utilizes a RANSAC algorithm applied to a dense incidence field regressed by a deep model.

It calibrates in-the-wild monocular images without checkerboard. (2) In two-view pose estimation,

we introduce LightedDepth.

It estimates the optimal pose by aligning the depth map with the

correspondence map, maximizing the projective inliers. (3) The strategy is extended to a Hough

Transform in RSfM for multi-view SfM over a local 3 to 9 frame system. (4) We generalize the

RSfM discrete inlier counting scoring function to a smoothed scoring function via marginalizing

thresholds for general SfM task. To this end, we formulate a comprehensive system that recovers

structure and motion from two-view / local multi-view / large-scale multi-view images with dense

monocular depthmap and binocular correspondence maps. Compared to prior arts, our methods

show comprehensive improvement on two-view, small-scale, and large-scale multi-view systems.

Copyright by
SHENGJIE ZHU
2025

This thesis is dedicated to my wife Lisheng, whose unwavering support and encouragement have
been my greatest source of strength throughout this journey.

iv

ACKNOWLEDGEMENTS

First and foremost, I would like to express my deepest gratitude to my advisor, Prof. Xiaoming

Liu, for his unwavering support, guidance, and mentorship throughout my Ph.D. journey. His high

expectations and relentless pursuit of excellence have pushed me beyond my limits, enabling me to

grow as a researcher and thinker. His willingness to engage in deep technical discussions, provide

constructive feedback, and challenge me to think critically has been invaluable in shaping my

research skills. Through his mentorship, I have learned the importance of perseverance, curiosity,

and precision in scientific inquiry. Beyond academics, his encouragement and belief in my potential

have been a constant source of motivation. I am truly honored to have had the opportunity to learn

from him, and his mentorship will continue to inspire me in my future endeavors.

I would also like to express my great pleasure to have Prof. Anil Jain, Prof. Daniel Morris, and

Prof. Vishnu Boddeti in my Ph.D. guidance committee for their valuable guidance and feedback.

I would like to thank Dr. Ahmed Abdelkader, Dr. Vincent Chu, and Mark Matthews for their

mentorship and support during my internship at Google. Your support was instrumental in shaping

the final chapter of this thesis. I would also like to thank Dr. Ning Zhou, Dr. Haotian Xu, Dr.

Jingyi Zhang, and Dr. Rui Hou for their mentorship during my internship at Amazon. It has been

a truly inspiring experience, broadening my perspective on research and innovation. The insights I

gained from the team have been a lasting source of motivation, driving my later research. Further,

I am deeply grateful to Prof. Fernando for providing me with the opportunity to present at CMU.

CVLab is a loving place.

I would like to thank all my labmates, Dr. Xi Yin, Dr. Amin

Jourabloo, Dr. Garrick Brazil, Dr. Luan Tran, Dr. Feng Liu, Dr. Yaojie Liu, Dr. Andrew Hou, Dr.

Abhinav Kumar, Dr. Vishal Asnani, Xiao Guo, Minchul Kim, Joel Stehouwer, Bangjie Yin, Hieu

Nguyen, Masa Hu, Ziyuan Zhang, Girish Chandar Ganesan, Yiyang Su, Jie Zhu, and Zhiyuan Ren,

for making my Ph.D. journey both productive and enjoyable.

Finally, I would like to extend my heartfelt gratitude to my wife and my parents for their

unconditional support. No words can fully capture the depth of my love for you.

v

TABLE OF CONTENTS

CHAPTER 1

.
1.1 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
6

CHAPTER 2

Introduction .

THE EDGE OF DEPTH: EXPLICIT CONSTRAINTS BETWEEN
9
SEGMENTATION AND DEPTH . . . . . . . . . . . . . . . . . . . . .
9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
2.1
2.2 Related work .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 The Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Experiments .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Conclusions .

.
.

.
.

.
.

.
.

.
.

.
.

.
.

CHAPTER 3

Introduction .

PMATCH: PAIRED MASKED IMAGE MODELING FOR DENSE
GEOMETRIC MATCHING . . . . . . . . . . . . . . . . . . . . . . . . 28
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
. 34
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

.
.
.
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
. .
.
.

3.1
.
.
3.2 Related works .
.
.
3.3 Method .
.
.
.
.
3.4 Experiments .
3.5 Ablation Study .
.
3.6 Conclusion .

.

.

CHAPTER 4

Introduction .

TAME A WILD CAMERA: IN-THE-WILD MONOCULAR CAM-
ERA CALIBRATION . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
. 49
. 51
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.

4.1
.
4.2 Related Works .
.
4.3 Method .
.
.
4.4 Experiments .
.
4.5 Conclusion .

.
.
. .
.
.
.
.
.
.

.

CHAPTER 5

LIGHTEDDEPTH: VIDEO DEPTH ESTIMATION IN LIGHT OF
LIMITED INFERENCE VIEW ANGLES

.
.

.
.

.
Introduction .
5.1
5.2 Prior Works .
.
5.3 Proposed Method .
.
5.4 Experiments .
.
5.5 Conclusions .

.
.

.
.

. . . . . . . . . . . . . . . . 62
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
. 65
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.

CHAPTER 6

Introduction .

RSFM: REVISIT SELF-SUPERVISED DEPTH ESTIMATION WITH
LOCAL STRUCTURE-FROM-MOTION . . . . . . . . . . . . . . . . 79
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
. 82
. 83
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.

6.1
.
6.2 Related Works .
6.3 Methodology .
.
6.4 Experiments .
.
6.5 Conclusion .

.
.
. .
. .
.
.
.
.

vi

CHAPTER 7

MOTION-FROM-STRUCTURE: LEVERAGING MONOCULAR DEPTH
PRIORS FOR MULTI-VIEW TASKS

Introduction .

7.1
.
7.2 Related Work .
.
7.3 Method .
.
.
7.4 Experiments .
.
7.5 Discussion .
.
.
7.6 Conclusion .

.

.
.
. .
.
.
.
.
.
.
.
.

. . . . . . . . . . . . . . . . . . 97
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
. 100
. 101
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
. 112
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.

CHAPTER 8

CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . 115
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.1 Conclusions .
8.2 Future Work Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

.

.

.

.

BIBLIOGRAPHY .

.

.

.

.

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

APPENDIX .

. .

. .

. .

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 140

vii

CHAPTER 1

INTRODUCTION

Estimating structure and motion from 2D images set is a fundamental task with diverse applications

in 3D reconstruction [31], robotics [132], and autonomous driving [310]. This task extracts 3D

point clouds, camera extrinsics, and camera intrinsics from RGB images, requiring a comprehensive

vision system that includes camera calibration, two-view pose estimation, and multi-view Structure-

from-Motion (SfM).

Classic Structure-from-Motion (SfM) methods [209, 287, 85] rely on sparse image corre-

spondence inputs extracted using feature detectors such as SIFT [150], SURF [15], and learned

descriptors like SuperPoint [63, 204]. These methods construct a sparse yet highly accurate 3D

point cloud by triangulating matched keypoints across multiple views. Given a well-initialized

system, a robust Bundle Adjustment (BA) algorithm jointly optimizes 3D point positions, camera

intrinsics, and camera poses by minimizing reprojection and photometric errors. However, classic

SfM approaches typically assume that the input image collection exhibits well-textured regions,

sufficient parallax between views, and a high degree of visual overlap—conditions that may not

always hold in real-world scenarios.

Recent advancements in deep learning have enabled the development of monocular depth

estimators [19] that generate dense depth maps from single RGB images without requiring camera

motion. The rise of transformer-based foundation models [259] has further accelerated research

efforts toward creating large-scale, highly generalizable monocular depth estimation models. These

models [295] are trained using a combination of large-scale labeled datasets and unlabeled image

collections, enhancing their robustness and adaptability. Notably, monocular depth models output

depth maps or point clouds in metric space, in contrast to traditional SfM systems, which produce up-

to-scale point clouds. Despite its growing capabilities, monocular depth estimation has limitations.

A major drawback is that the generated point clouds are significantly noisier compared to those

produced by SfM. Additionally, accurately quantifying the noise level in these depth predictions

remains an open research question. Since classic SfM algorithms rely on sparse and highly accurate

1

points, the high variance in monocular depth maps makes them less suitable for direct integration

into traditional SfM pipelines.

A similar trend is observed in image correspondence estimation models [331]. Traditionally,

image correspondence relied on handcrafted feature descriptors such as SIFT [151] and ORB [175].

In contrast, learning-based methods utilize labeled data to automatically learn image matching fea-

tures during training. These models utilize more powerful computational resources, such as GPUs.

As a result, learning-based approaches have achieved significantly higher accuracy compared to

their handcrafted counterparts. Recently, several studies have extended sparse image correspon-

dence estimation to dense correspondence estimation [69]. The transition to a dense output format

allows learning-based models to incorporate additional global priors. Experimental results in two-

view pose estimation have shown that dense correspondence can improve pose estimation accuracy.

However, similar to monocular depth models, dense correspondence estimators face challenges in

integrating with classic SfM methods, as these systems are designed to operate on sparse point

clouds.

Camera pose estimation has become increasingly important with the growing number of ap-

plications that rely on precise spatial localization. For instance, autonomous vehicles, drones, and

other robotic systems depend on accurate pose estimation for navigation. Additionally, emerging

3D image generation methods [176], which synthesize coherent 3D models from multiview images,

require well-registered input images. Neural rendering techniques [170], with significant potential

in AR/VR applications, also assume multi-view images with known camera poses.

Pioneering work in pose estimation has explored the integration of deep learning with camera

pose estimation. One line of research focuses on absolute camera pose regression, where a deep

neural network takes a single image or an image pair as input and directly regresses the absolute

camera pose in world coordinates [24] or the relative pose between the two images [216]. Another

approach is scene coordinate regression, where the model predicts a 3D point cloud either in

a global world coordinate system [21] or relative to the input images [273]. However, there is

still insufficient evidence that these learning-based methods consistently outperform traditional

2

geometric approaches. In this thesis, we propose a novel framework that effectively combines deep

learning with camera pose estimation, leveraging the strengths of both paradigms. Our approach

utilizes deep networks for dense, pixel-wise predictions guided by spatial geometric priors, i.e.,

the dense depthmaps and correspondence maps. Then, we employ a post-optimization scheme to

refine the low-degree-of-freedom (DoF) camera poses based on the dense yet noisy predictions.

In this dissertation, we present a comprehensive system for estimating multi-view camera

intrinsics and extrinsics by leveraging network outputs, specifically dense image correspondence

maps and depth maps. Our approach begins with the dense depth estimator EdgeDepth [327] and

the dense image correspondence estimator PMatch [331]. We then introduce a two-view pose

initialization method, LightedDepth [330], followed by RSfM [332], a multi-view pose estimation

algorithm designed to refine the two-view initialized results within a small multi-view system.

Finally, we present MfS, an extension of RSfM that enhances performance across both small-scale

and large-scale multi-view pose estimation scenarios. We start with the inputs to our system, i.e.,

the Monocular Depth Estimator and Binocular Correspondence Estimator.

In Chapter 2, we present a monocular depth estimarot EdgeDepth [327]. EdgeDepth explores the

mutual benefits between self-supervised monocular depth estimation and semantic segmentation,

two fundamental tasks in computer vision. Unlike previous methods that implicitly model their

relationship, we introduce an explicit border consistency constraint, ensuring alignment between

segmentation and depth edges. We leverage a novel morphing algorithm to iteratively refine depth

predictions, making them more consistent with segmentation boundaries. Additionally, we identify

and mitigate bleeding artifacts commonly found in stereo-based self-supervised depth estimation

using a stereo occlusion masking technique, further enhancing depth quality near object edges. Our

approach achieves state-of-the-art performance on self-supervised monocular depth estimation, for

the first time matching supervised methods in absolute relative error on the KITTI dataset.

In Chapter 3, we introduce PMatch [331], a novel Paired Masked Image Modeling (pMIM)

framework designed for dense geometric matching. Traditional monocular pretraining tasks, such

as image classification and masked image modeling (MIM), fail to optimize the cross-frame match-

3

ing module, limiting their effectiveness in geometric correspondence estimation. To overcome

this, we reformulate MIM from reconstructing a single masked image to reconstructing a pair of

masked images, enabling more effective pretraining of the transformer-based matching module.

Additionally, we propose a cross-frame global matching module (CFGM) that enhances robust-

ness in textureless regions by incorporating positional embeddings and a homography loss, which

regularizes correspondences on planar surfaces. Through these innovations, PMatch achieves state-

of-the-art performance in dense geometric matching, outperforming both sparse and dense methods

on diverse benchmark datasets.

Given the input depth maps and correspondence maps, we outline our pose estimation system,

beginning with a monocular intrinsic calibration method. This is followed by a two-view pose

initialization approach. Finally, we introduce RSfM for small-scale multi-view pose estimation and

MfS for large-scale multi-view pose estimation.

In Chapter 4, we introduce WildCamera [329], a 4 Degree-of-Freedom (DoF) camera calibration

method tailored for in-the-wild images. Our approach is motivated by the intrinsic relationship

between monocular depth maps and surface normal maps, where the optimal intrinsic parameters

should align the depth map consistently with the normal map. However, traditional depth-normal-

based calibration methods suffer from numerical instability due to their dependence on accurate

depth gradients. To address this, we propose an alternative representation—the incidence field, a

novel 3D monocular prior that models the incidence rays between observed 3D points and their

corresponding 2D projections on the imaging plane. Unlike conventional depth and normal maps,

the incidence field remains invariant to image cropping and resizing, enhancing its generalization

to in-the-wild images. We develop a deep neural network to estimate the incidence field and

introduce a non-learning RANSAC-based optimization algorithm to recover intrinsic parameters

from the estimated field. Our method achieves state-of-the-art performance on synthetic and real-

world datasets, offering a robust solution for monocular camera calibration and enabling diverse

downstream applications, including image manipulation detection, uncalibrated two-view pose

estimation, and improved 3D sensing.

4

In Chapter 5, we present LightedDepth [330], a novel two-view SfM algorithm centered around

a two-view metric space pose initialization approach. Given two input images, we extract dense

monocular depth maps and image correspondences. Our method proceeds in three key stages: (1)

We estimate a normalized up-to-scale camera pose from the correspondences. (2) We determine the

metric space translation scale using a majority-voting algorithm, which incorporates a robust, non-

differentiable inlier-counting-based scoring function to enhance reliability. This strategy effectively

accommodates depth map noise by leveraging its density. (3) Finally, we complete two-view SfM by

estimating the two-view structure as video depth, formulated as a logged residual regression over the

monocular depth input. Through this decomposition, LightedDepth achieves superior performance

in video depth estimation, demonstrating robustness in scenarios with limited inference view angles

while maintaining computational efficiency.

This thesis explores advancements in self-supervised depth estimation by integrating local

Structure-from-Motion (SfM). Traditional self-supervised depth estimation relies on photometric

loss across immediate neighboring frames, often neglecting geometric consistency. To bridge

this gap, we propose a local SfM approach with a novel Bundle-RANSAC-Adjustment algorithm

that optimizes camera poses and depth adjustments across multiple frames. Experimental results

demonstrate that with only a few frames, our method significantly improves depth accuracy and

consistency, outperforming state-of-the-art supervised models.

In sparse-view pose estimation,

our approach achieves certified global optimality and surpasses existing methods in both rotational

and translational accuracy. Additionally, it enhances correspondence estimation, confirming its

robustness and applicability. These results establish that self-supervision within limited frames not

only benefits supervised models but also sets new standards in pose and depth estimation, advancing

applications in AR/VR, autonomous driving, and 3D reconstruction.

In Chapter 6, we propose RSfM [332], which extends LightedDepth [330]’s majority voting

from two-view SfM to a local multi-view SfM with 3 to 9 frames. To address the non-differentiable

inlier-counts scoring function, we introduce a Hough Transform to convert it to a differentiable

manifold space. However, this transformation assumes all frames are mutually visible, limiting its

5

scalability. Despite this, RSfM shows improved pose accuracy over classic SfM by utilizing 3D

priors within dense monocular depth maps, whereas classic methods [209] rely on triangulation,

which is less effective with limited camera views (3 to 5 frames). Experiments demonstrate that self-

supervision with only 5 frames already enhances the performance of state-of-the-art supervised

models across datasets like ScanNet and KITTI360, achieving improvements in pose accuracy,

depth consistency, and correspondence estimation.

In Chapter 7, we present Motion-from-Structure (MfS). We generalize the inlier counting strat-

egy adopted in RSfM [332] to large-scale SfM systems. This method leverages the dense structural

information from monocular depth priors to directly estimate camera motion without the need for

per-pixel depth adjustments or model fine-tuning. Central to MfS is a reformulated bundle adjust-

ment framework that distinguishes inliers and outliers through a robust scoring function. Unlike

traditional methods that rely on a single inlier threshold, MfS generalizes this by computing an

Area-Under-Curve (AUC) over multiple thresholds, effectively modeling the residual distribution as

a continuous cumulative distribution function (CDF). This approach not only mitigates sensitivity

to hyper-parameters but also offers a smooth and differentiable optimization landscape. Experi-

ments on diverse datasets, including the sparse-set ETH3D and the large-scale dense-set ScanNet,

demonstrate MfS’s ability to achieve state-of-the-art performance in multi-view pose estimation

and camera re-localization. Notably, MfS consistently outperforms classical methods by robustly

handling noisy depth maps, achieving high accuracy even in challenging scenarios with limited

texture or motion parallax. Furthermore, the method’s scalable and plug-and-play design allows

it to integrate seamlessly with arbitrary monocular depth estimation models, promoting efficient

large-scale SfM without compromising accuracy.

1.1 Contributions of the Thesis

This thesis presents significant advancements in the field of Structure-from-Motion (SfM)

and camera pose estimation, addressing challenges related to dense depth and correspondence

estimation, camera calibration, and multi-view pose estimation. The primary contributions are:

1. Chapter 2: The Edge of Depth: Explicit Constraints between Segmentation and Depth

6

• Introduced EdgeDepth, a novel self-supervised monocular depth estimation framework

that explicitly enforces border consistency between depth and semantic segmentation

maps. This approach improves depth accuracy near object boundaries by employing a

morphing algorithm and stereo occlusion masking to mitigate common artifacts.

• Achieved state-of-the-art performance on self-supervised depth estimation benchmarks,

matching supervised methods on the KITTI dataset in terms of absolute relative error.

2. Chapter 3: PMatch: Paired Masked Image Modeling for Dense Geometric Matching

• Developed PMatch, a transformer-based framework utilizing Paired Masked Image

Modeling (pMIM) for robust dense geometric matching. This method enhances corre-

spondence estimation in textureless regions using a cross-frame global matching module

and homography loss.

• Demonstrated superior performance over existing sparse and dense matching methods

across diverse benchmark datasets.

3. Chapter 4: Tame a Wild Camera: In-the-Wild Monocular Camera Calibration

• Proposed WildCamera, a 4 DoF camera calibration technique leveraging the novel

concept of an incidence field. This approach ensures robustness to image cropping and

resizing, enhancing its generalization to in-the-wild datasets.

• Designed a deep learning model to estimate incidence fields and integrated a RANSAC-

based optimization method for reliable intrinsic parameter recovery.

• Achieved state-of-the-art calibration performance on both synthetic and real-world

datasets, enabling diverse downstream applications.

4. Chapter 5: LightedDepth: Video Depth Estimation in light of Limited Inference View Angles

• Presented LightedDepth, a two-view SfM algorithm that accurately estimates metric

space poses by integrating dense depth and correspondence inputs. This method intro-

7

duces a robust majority-voting mechanism to determine translation scales and refines

depth predictions through residual regression.

• Demonstrated robustness and superior performance in challenging scenarios with lim-

ited inference angles while maintaining computational efficiency.

5. Chapter 6: RSfM: Revisit Self-supervised Depth Estimation with Local Structure-from-

Motion

• Introduced RSfM, extending the LightedDepth framework to local multi-view settings

(3–9 frames). This method innovates by converting non-differentiable inlier counts into

a differentiable manifold space using the Hough Transform, enhancing pose accuracy

in scenarios with limited mutual visibility.

• Verified its effectiveness through experiments, showing improvements in pose accu-

racy, depth consistency, and correspondence estimation across benchmark datasets like

ScanNet and KITTI360.

6. Chapter 7: Motion-from-Structure: Leveraging Monocular Depth Priors for Multi-View

Tasks

• Developed Motion-from-Structure (MfS), which generalizes the inlier counting strategy

to large-scale SfM. MfS introduces a robust scoring function based on an Area-Under-

Curve (AUC) framework, improving optimization smoothness and reducing sensitivity

to hyper-parameters.

• Demonstrated state-of-the-art performance in large-scale multi-view pose estimation

and camera re-localization, particularly excelling in challenging scenarios involving

noisy depth maps and limited texture or motion parallax.

Together, these contributions enhance structure and motion estimation from RGB image collections

by bridging dense learning-based approaches with traditional geometric methods, leading to more

accurate and scalable solutions.

8

CHAPTER 2

THE EDGE OF DEPTH: EXPLICIT CONSTRAINTS BETWEEN SEGMENTATION
AND DEPTH

In this work we study the mutual benefits of two common computer vision tasks, self-supervised

depth estimation and semantic segmentation from images. For example, to help unsupervised

monocular depth estimation, constraints from semantic segmentation has been explored implicitly

such as sharing and transforming features. In contrast, we propose to explicitly measure the border

consistency between segmentation and depth and minimize it in a greedy manner by iteratively

supervising the network towards a locally optimal solution. Partially this is motivated by our

observation that semantic segmentation even trained with limited ground truth (200 images of

KITTI) can offer more accurate border than that of any (monocular or stereo) image-based depth

estimation. Through extensive experiments, our proposed approach advances the state of the art on

unsupervised monocular depth estimation in the KITTI.

2.1

Introduction

Estimating depth is a fundamental problem in computer vision with notable applications in self-

driving [29] and virtual/augmented reality. To solve the challenge, a diverse set of sensors has been

utilized ranging from monocular camera [87], multi-view cameras [46], and depth completion from

LiDAR [114]. Although the monocular system is the least expensive, it is the most challenging due

to scale ambiguity. The current highest performing monocular methods [296, 97, 163, 135, 79] are

reliant on supervised training, thus consuming large amounts of labelled depth data. Recently, self-

supervised methods with photometric supervision have made significant progress by leveraging

unlabeled stereo images [82, 87] or monocular videos [325, 260, 305] to approach comparable

performance as the supervised methods.

Yet, self-supervised depth inference techniques suffer from high ambiguity and sensitivity in

low-texture regions, reflective surfaces, and the presence of occlusion, likely leading to a sub-

optimal solution. To reduce these effects, many works seek to incorporate constraints from external

modalities. For example, prior works have explored leveraging diverse modalities such as optical

9

Figure 2.1 We explicitly regularize the depth border to be consistent with segmentation border. A
“better" depth I∗ is created through morphing according to distilled point pairs pq. By penalizing
its difference with the original prediction I at each training step, we gradually achieve a more
consistent border. The morph happens over every distilled pairs but only one pair illustrated, due
to limited space.

flow [305], surface normal [297], and semantic segmentation [40, 269, 173, 320]. Optical flow

can be naturally linked to depth via ego-motion and object motion, while surface normal can be

re-defined as direction of the depth gradient in 3D. Comparatively, semantic segmentation is unique

in that, though highly relevant, it is difficult to form definite relationship with depth.

In response, prior works tend to model the relation of semantic segmentation and depth im-

plicitly [40, 269, 173, 320]. For instance, [40, 269] show that jointly training a shared network

with semantic segmentation and depth can help lean both modalities. [320] learns a transformation

between semantic segmentation and depth feature spaces. Despite empirically positive results, such

techniques lack clear and detailed explanation for their improvement. Moreover, prior work has

yet to explore the relationship from one of the most obvious aspects — the shared borders between

segmentation and depth.

Hence, we aim to explicitly constrain monocular self-supervised depth estimation to be more

consistent and aligned to its segmentation counterpart. We validate the intuition of segmentation

being stronger than depth estimation for estimating object boundaries, even compared to depth from

10

multi-view camera systems [304], thus demonstrating the importance of leveraging this strength

(Tab. 2.3). We use the distance between segmentation and depth’s edges as a measurement of their

consistency. Since this measurement is not differentiable, we can not directly optimize it as a loss.

Rather, it is optimized as a “greedy search", such that we iteratively construct a local optimum

augmented disparity map under the proposed measurement and penalize its discrepancy with the

original prediction. The construction of augmented depth map is done via a modified Beier–Neely

morphing algorithm[256]. In this way, the estimated depth map gradually becomes more consistent

with the segmentation edges within the scene, as demonstrated in Fig. 7.1.

Since we use predicted semantics labels[333], noise is inevitably inherited. To combat this, we

develop several techniques to stabilize training as well as improve performance. We also notice

recent stereo-based self-supervised methods ubiquitously possess “bleeding artifacts", which are

fading borders around two sides of objects. We trace its cause to occlusions in stereo cameras near

object boundaries and resolve by integrating a novel stereo occlusion mask into the loss, further

enabling quality edges and subsequently facilitating our morphing technique.

Our contributions can be summarized as follows:

⋄ We explicitly define and utilize the border constraint between semantic segmentation and

depth estimation, resulting in depth more consistent with segmentation.

⋄ We alleviate the bleeding artifacts in prior depth methods [88, 87, 40, 191] via proposed stereo

occlusion mask, furthering the depth quality near object boundaries.

⋄ We advance the state-of-the-art (SOTA) performance of the self-supervised monocular depth

estimation task on the KITTI dataset, which for the first time matches SOTA supervised performance

in the absolute relative metric.

2.2 Related work

Self-supervised Depth Estimation Self-supervision has been a pivotal component in depth

estimation [325, 260, 305]. Typically, such methods require only a monocular image in inference

but are trained with video sequences, stereo images, or both. The key idea is to build pixel

correspondences from a predicted depth map among images of different view angles then minimize

11

a photometric reconstruction loss for all paired pixels. Video-based methods [325, 260, 305]

require both depth map estimation and ego-motion. While stereo system [82, 87] requires a pair of

images captured simultaneously by cameras with known relative placement, reformulating depth

estimation into disparity estimation.

We note the photometric loss is subject to two general issues: (1) When occlusions present,

via stereo cameras or dynamic scenes in video, an incorrect pixel correspondence can be made

yielding sub-optimal performance.

(2) There exists ambiguity in low-texture or color-saturated

areas such as sky, road, tree leaves, and windows, thereby receiving a weak supervision signal. We

aim to address (1) by proposed stereo occlusion masking, and (2) by leveraging additional explicit

supervision from semantic segmentation.

Occlusion Problem Prior works in video-based depth estimation [88, 260, 117, 35] have begun

to address the occlusion problem. [88] suppresses occlusions by selecting pixels with a minimum

photometric loss in consecutive frames. Other works [260, 117] leverage optical flow to account for

object and scene movement. In comparison, occlusion in stereo pairs has not received comparable

attention in SOTA methods. Such occlusions often result in bleeding depth artifacts when (self-

)supervised with photometric loss.

[87] partially relieves the bleeding artifacts via a left-right

consistency term. Comparatively, [191, 296] incorporates a regularization onto the depth magnitude

to suppress the artifacts.

In our work, we propose an efficient occlusion masking based only on a single estimated

disparity map, which significantly improves estimation convergence and qualities around dynamic

objects’ border (Sec. 2.3.2). Another positive side effect is improved edge maps, which facilitates

our proposed semantic-depth edge consistency (Sec. 2.3.1).

Using Additional Modalities To address weak supervision in low-texture regions, prior work has

begun incorporating modalities such as surface normal [297], semantic segmentation [194, 40, 269,

173], optical flow [260, 117] and stereo matching proxies [278, 247]. For instance, [297] constrains

the estimated depth to be more consistent with predicted surface normals. While [278, 247] leverage

proxy disparity labels produced by Semi-Global Matching (SGM) algorithms [107, 108], which

12

Figure 2.2 Framework Overview. The blue box indicates input while yellow box indicates the
estimation. The encoder-decoder takes only a left image I, to predict the corresponding disparity
I ˆd which will be converted to depth map Id. The prediction is supervised via a photometric
reconstruction loss 𝑙𝑟, morph loss 𝑙𝑔, and stereo matching proxy loss 𝑙 𝑝.

serve as additional psuedo ground truth supervision. In our work, we provide a novel study focusing

on constraints from the shared borders between segmentation and depth.

Using Semantic Segmentation for Depth The relationship between depth and semantic seg-

mentation is fundamentally different from the aforementioned modalities. Specifically, semantic

segmentation does not inherently hold a definite mathematical relationship with depth. In contrast,

surface normal can be interpreted as normalized depth gradient in 3D space; disparity possesses an

inverse linear relationship with depth; and optical flow can be decomposed into object movement,

ego-motion, and depth estimation. Due to the vague relationship between semantic segmentation

and depth, prior work primarily use it in an implicit manner.

We classify the uses of segmentation for depth estimation into three categories. Firstly, share

weights between semantics and depth branches as in [40, 269]. Secondly, mix semantics and

depth features as in [269, 173, 320]. For instance, [269, 173] use a conditional random field to

pass information between modalities. Thirdly, [124, 194] opt to model the statistical relationship

between segmentation and depth. [124] specifically models the uncertainty of segmentation and

13

depth to re-weight themselves in the loss function.

Interestingly, no prior work has leveraged the border consistency naturally existed between

segmentation and depth. We emphasize that leveraging this observation has two difficulties.

First, segmentation and depth only share partial borders. Secondly, formulating a differentiable

function to link binarized borders to continuous semantic and depth prediction remains a challenge.

Hence, designing novel approaches to address these challenges is our contribution to an explicit

segmentation-depth constraint.

2.3 The Proposed Method

We observe recent self-supervised depth estimation methods[278] preserve deteriorated ob-

ject borders compared to semantic segmentation methods[333] (Tab. 2.3).

It motivates us to

explicitly use segmentation borders as a constraint in addition to the typical photometric loss.

We propose an edge-edge consistence loss 𝑙𝑐 (Sec. 2.3.1.1) between depth map and segmentation

map. However, as the 𝑙𝑐 is not differentiable, we circumvent it by constructing an optimized depth

map I∗

d and penalizing its difference with original prediction Id (Sec. 2.3.3.1). This construction

is accomplished via a novel morphing algorithm (Sec. 2.3.1.2). Additionally, we resolve bleeding

artifacts (Sec. 2.3.2) for improved border quality and rectify batch normalization layer statistics via

a finetuning strategy (Sec. 2.3.3.1). As in Fig. 6.3, our method consumes stereo image pairs and

precomputed semantic labels [333] in training, while only requiring a monocular RGB image at

inference. It predicts a disparity map I ˆd and then converted to depth map Id given baseline 𝑏 and
focal length 𝑓 under relationship Id =

.

𝑓 ·𝑏
I ˆd

2.3.1 Explicit Depth-Segmentation Consistency

To explicitly encourage estimated depth to agree with its segmentation counterpart on their

edges, we propose two steps. We first extract matching edges from segmentation Is and cor-

responding depth map Id (Sec. 2.3.1.1). Using these pairs, we propose a continuous morphing

function to warp all depth values in its inner-bounds (Sec. 2.3.1.2), such that depth edges are aligned

to semantic edges while preserving the continuous integrity of the depth map.

14

2.3.1.1 Edge-Edge Consistency

In order to define the edge-edge consistency, we must firstly extract the edges from both

the segmentation map Is and depth map Id. We define Is as a binary foreground-background

segmentation map, whereas the depth map Id consists of continuous depth values. Let us denote

an edge T as the set of pixel p locations such that:

(cid:13)
(cid:13)
(cid:13)
(cid:13)
is a 2D image gradient at p and 𝑘1 is a hyperparameter controlling necessary gradient

𝜕I(p)
𝜕x

T = {p |

> 𝑘1},

(2.1)

(cid:13)
(cid:13)
(cid:13)
(cid:13)

where

𝜕I(p)
𝜕x

intensity to constitute an edge. In order to highlight clear borders in close-range objects, the depth

edge Td is extracted from the disparity map I ˆd instead of Id. Given an arbitrary segmentation edge
point q ∈ Ts, we denote 𝛿(q, Td) as the distance between q to its closest point in depth edge Td:

𝛿(q, Td) = min

{p|p∈Td}

∥p − q∥ .

(2.2)

Since the correspondence between segmentation and depth edges do not strictly follow an one-one

mapping, we limit it to a predefined local range. We denote the valid set 𝚪 of segmentation edge

points q ∈ Ts such that:

𝚪(Ts | Td) = {q | ∀q ∈ Ts, 𝛿(q, Td) < 𝑘2} ,

(2.3)

where 𝑘2 is a hyperparamter controlling the maximum distance allowed for association. For notation

simplicity, we denote 𝚪d

s = 𝚪(Ts | Td). Then the consistency 𝑙𝑐 between the segmentation Ts and

depth Td edges is as:

𝑙𝑐 (𝚪(Ts | Td), Td) =

1
(cid:13)
(cid:13)𝚪d
s

(cid:13)
(cid:13)

∑︁

q∈𝚪d
s

𝛿(q, Td).

(2.4)

Due to the discretization used in extracting edges from Is and Id, it is difficult to directly optimize

𝑙𝑐 (𝚪d
s

, Td). Thus, we propose a continuous morph function (𝜙 and 𝑔 in Sec. 2.3.1.2) to produce an

augmented depth I∗

d, with a corresponding depth edge T∗

d that minimizes:

Note that the 𝑙𝑐 loss is asymmetric. Since the segmentation edge is more reliable, we prefer to use

𝑙𝑐 (𝚪(Ts | Td), T∗

d).

(2.5)

𝑙𝑐 (𝚪d
s

, T∗

d) rather than its inverse mapping direction of 𝑙𝑐 (𝚪s

d

, T∗

s ).

15

Figure 2.3 The morph function 𝜙(·) morphs a pixel x to pixel x∗, via Eq. 2.7 and 2.8.
(a) A
source image I is morphed to I∗ by applying 𝜙(x|q, p) to every pixel x ∈ I∗ with the closest pair
of segmentation q and depth p edge points. (b) we show each term’s geometric relationship. The
−→
po. Point o is controlled by term 𝑡 in the extended line of
morph warps x around
−→
qp.

−→
qo to x∗ around

2.3.1.2 Depth Morphing

In the definition of consistence measurement 𝑙𝑐 in Eq. (2.5), we acquire a set of associations

between segmentation and depth border points. We denote this set as 𝛀:

(cid:110)

𝛀 =

p | argmin
{p|p∈Td}

∥p − q∥ , q ∈ 𝚪d
s

(cid:111)

.

(2.6)

Associations in 𝛀 imply depth edge p should be adjusted towards segmentation edge q to minimize

consistence measurement 𝑙𝑐. This motivates us to design a local morph function 𝜙(·) which maps

an arbitrary point x near a segmentation point q ∈ 𝚪d

s and associated depth point p ∈ 𝛀 to:

x∗ = 𝜙(x | q, p) = x + −→

qp −

−−→
qx′,

·

1
1 + 𝑡

(2.7)

where hyperparameter 𝑡 controls sample space illustrated in Fig. 2.3, and x′ denotes the point

projection of x onto

−→
qp:

x′ = q + (−→

qx · ˆqp) · ˆqp,

(2.8)

where ˆqp is the unit vector of the associated edge points. We illustrate a detailed example of 𝜙(·)

in Fig. 2.3.

To promote smooth and continuous morphing, we further define a more robust morph function

𝑔(·), applied to every pixel x ∈ I∗

d as a distance-weighted summation of all morphs 𝜙(·) for each

16

associated pair (q, p) ∈ (𝚪d
s

, 𝛀):

𝑔(x | q, p) =

𝑖=|𝛀|
∑︁

𝑤(𝑑𝑖)

(cid:205) 𝑗=|𝛀|
𝑗=0

𝑤(𝑑 𝑗 )

𝑖=0

· ℎ(𝑑𝑖) · 𝜙(x | p𝑖, q𝑖),

(2.9)

where 𝑑𝑖 is the distance between x𝑖 and edge segments

−−→
q𝑖p𝑖. ℎ(·) and 𝑤(·) are distance-based
)𝑚4, and ℎ(𝑑𝑖) = Sigmoid(−𝑚1 · (𝑑𝑖 − 𝑚2)), where 𝑚1, 𝑚2, 𝑚3,
𝑚4 are predefined hyperparameters. 𝑤(·) is a relative weight compromising morphing among

weighting functions: 𝑤(𝑑𝑖) = (

1
𝑚3+𝑑𝑖

multiple pairs, while ℎ(·) acts as an absolute weight ensuring each pair only affects local area.

Implementation wise, ℎ(·) makes pairs beyond ∼7 pixels negligible, facilitating 𝑔(x | q, p) linear

computational complexity.

In summary, 𝑔(x | q, p) can be viewed as a more general Beier–Neely [256] morph, due to

inclusion of ℎ(·). We align depth map better to segmentation via applying 𝑔(·) morph to pixels of

its disparity map x ∈ I∗
ˆd

, creating a segmentation-augmented disparity map I∗
ˆd

:

I∗
ˆd

(x) = I ˆd(𝑔(x | q, p))

⊢ ∀(p, q) ∈ (𝛀, Γ), p = 𝜙(q).

(2.10)

Next we may transform the edge-to-edge consistency term 𝑙𝑐 into the minimization of difference

between I ˆd and the segmentation-augmented I∗
ˆd
as local minimum of 𝑙𝑐 under certain condition is in the supplementary material (Suppl.).

, as detailed in Sec. 2.3.3.1. A concise proof of I∗
d

2.3.2 Stereo Occlusion Mask

Bleeding artifacts are a common difficulty in self-supervised stereo methods [88, 87, 40,

191]. Specifically, bleeding artifacts refer to instances where the estimated depth on surrounding

foreground objects wrongly expands outward to the background region. However, few works

provide detailed analysis of its cause. We illustrate the effect and an overview of our stereo

occlusion mask in Fig. 2.4.

Let us define a point b ∈ Id near the boundary of an object and corresponding point b† ∈
I†
d in the right stereo view. When point b† is occluded by a foreground point c† in the right
stereo, a photometric loss will seek a similar non-occluded point in the right stereo, e.g., the

objects’ left boundary a†, since no exact solution may exist for occluded pixels. Therefore, the

17

(a)

(b)

(c)

Figure 2.4 (a) Overlays disparity estimation over the input image showing typical bleeding artifacts.
(b) We denote the red object contour from the left view I and green object contour from the right
view I†. Background point b is visible in the left view, yet its corresponding right point b† is
occluded by an object point c†. Thus, this point is incorrectly supervised by photometric loss 𝑙𝑟 to
look for the nearest background pixel (e.g., a†) leading to a bleeding artifact in (a). (c) We depict
occluded region detected via Eq. 2.11.

disparity value at point b will be ˆ𝑑∗

b =

(cid:13)
(cid:13)
(cid:13)
(cid:13)

−−→
a†b

(cid:13)
(cid:13)
(cid:13)
(cid:13)

= 𝑥b − 𝑥

a†, where 𝑥 is the horizontal location.

Since background is assumed farther away than foreground points, generally a false supervision

has the quality such that the occluded background disparity will be significantly larger than its

(unknown) ground truth value. As b approaches a† the effect is lessened, creating a fading effect.

To alleviate the bleeding artifacts, we form an occlusion indicator matrix M such that M(𝑥, 𝑦) =

1

if

the

pixel

location

(𝑥, 𝑦)

has

possible

occlusions

in

the

stereo

view.

For instance, in the left stereo image M is defined as:

M(𝑥, 𝑦) =





1

min
𝑖∈(0,𝑊−𝑥]

(cid:0)I ˆd(𝑥 + 𝑖, 𝑦) − I ˆd(𝑥, 𝑦) − 𝑖(cid:1) ≥ 𝑘3

(2.11)

0 otherwise,

where 𝑊 denotes predefined search width and 𝑘3 is a threshold controlling thickness of the mask.

The disparity value in the left image represents the horizontal left distance of each pixel to be

moved. As the occlusion is due to pixels in its right, we intuitively perform our search in one direc-

left and cover itself. In this way, occlusion can be detected as min

tion. Additionally, we can view occlusion as when neighbouring pixels on its right move too much
(cid:0)I ˆd(𝑥 + 𝑖, 𝑦) − I ˆd(𝑥, 𝑦) − 𝑖(cid:1) ≥
0. Considering bleeding artifacts in Fig. 2.4, we use 𝑘3 to counter large incorrect disparity values

𝑖∈(0,𝑊−𝑥]

of occluded background pixels. The regions indicated by M are then masked when computing a

reconstruction loss (Sec. 2.3.3.1).

18

✓

H × W
256 × 512
256 × 512

Method
Yang et al.
Guo et al.
Luo et al.
Kuznietsov et al.
Fu et al.
Lee et al.
Godard et al.
Mehta et al.
Poggi et al.
Zhan et al.
Luo et al.
Pillai et al.
Tosi et al.
Chen et al.
Godard et al.

Cita.
[296]
[97]
[163]
[135]
[79]
[137]
[87]
[167]
[193]
[312]
[161]
[191]
[247]
[40]
[88]
[278] Watson et al. (ResNet18) ✓
✓
[278] Watson et al. (ResNet50) ✓
✓

PP Data
D†S
✓
D∗DS
D∗DS 192 × 640 crop
DS
D
D
S
S
✓
S
✗ MS
MS
✓
S
✓
S
✓
SC
✓ MS
S
SC†
S
SC†

187 × 621
385 × 513 crop
352 × 1, 216
256 × 512
256 × 512
256 × 512
160 × 608
256 × 832
384 × 1, 024
256 × 512 crop
256 × 512
320 × 1, 024
320 × 1, 024
320 × 1, 024
320 × 1, 024
320 × 1, 024

Ours (ResNet18)

Ours (ResNet50)

Size (Mb) Abs Rel
0.097
0.097
0.094
0.113
0.099
0.091
0.138
0.128
0.126
0.135
0.128
0.112
0.111
0.118
0.104
0.099
0.097
0.096
0.091

-
79.5
1, 562
324.8
399.7
563.4
382.5
-
954.3
-
160
-
511.0
-
59.4
59.4
59.4
138.6
138.6

Sq Rel RMSE RMSE log
0.734
0.653
0.626
0.741
0.593
0.555
1.186
1.019
0.961
1.132
0.935
0.875
0.867
0.905
0.775
0.723
0.675
0.710
0.646

0.187
0.170
0.177
0.189
0.161
0.174
0.234
0.227
0.220
0.229
0.209
0.207
0.199
0.211
0.191
0.187
0.180
0.185
0.177

4.442
4.170
4.252
4.621
3.714
4.033
5.650
5.403
5.205
5.585
5.011
4.958
4.714
5.096
4.562
4.445
4.350
4.393
4.244

𝛿 < 1.25
0.888
0.889
0.891
0.862
0.897
0.904
0.813
0.827
0.835
0.820
0.831
0.852
0.864
0.839
0.878
0.886
0.890
0.890
0.898

𝛿 < 1.252
0.958
0.967
0.965
0.960
0.966
0.967
0.930
0.935
0.941
0.933
0.945
0.947
0.954
0.945
0.959
0.962
0.964
0.962
0.966

𝛿 < 1.253
0.980
0.986
0.984
0.986
0.986
0.984
0.969
0.971
0.974
0.971
0.979
0.977
0.979
0.977
0.981
0.981
0.983
0.981
0.983

Table 2.1 Depth Estimation Performance, on KITTI Stereo 2015 dataset eigen splits [71] capped
at 80 meters. The Data column denotes: D for ground truth depth, D† for SLAM auxiliary data,
D∗ for synthetic depth labels, S for stereo pairs, M for monocular video, C for segmentation labels,
C† for predicted segmentation labels. PP denotes post-processing. Size refers to the model size in
Mb, which could be different depend on implementation language.

2.3.3 Network and Loss Functions

Our network is comprised of an encoder-decoder, identical to the baseline [278]. It takes in a

monocular RGB image and predicts corresponding disparity map which is later converted to depth

map under known camera parameters.

2.3.3.1 Loss Functions

The overall loss function is comprised of three terms:

𝑙 = 𝑙𝑟 (I ˆd(x)) + 𝜆2𝑙𝑔 (I ˆd(x)) + 𝜆1𝑙 𝑝 (I ˆd(x)),

(2.12)

where 𝑙𝑟 denotes a photometric reconstruction loss, 𝑙𝑔 a morphing loss, 𝑙 𝑝 a stereo proxy loss [278],

and x are the non-occluded pixel locations, i.e., {x | M(x) = 0}. 𝜆1 and 𝜆2 are the weights of

terms. We emphasize that exclusion will not prevent learning of object borders. E.g., in Fig. 2.4(c),

although the pixel b in cyclist’s left border is occluded, the network can still learn to estimate depth

from a visible and highly similar pixel a† in the stereo counterpart, as both left and right view images

are respectively fed into the encoder in training, similar to prior self-supervised works [278, 88].

Following [88], we define the 𝑙𝑟 reconstruction loss as:

𝑙𝑟 (cid:0)I ˆd(x)(cid:1) = 𝛼

1 − SSIM (cid:0)I(x), ˜I(x)(cid:1)
2

+ (1 − 𝛼)|I(x) − ˜I(x)|,

(2.13)

19

Method

Area Abs Rel Sq Rel RMSE RMSE log 𝛿 < 1.25

Watson et al. [278]

Ours (ResNet50)

O 0.085 0.507 3.684
W 0.096 0.712 4.403
N 0.202 2.819 8.980
O 0.081 0.466 3.553
W 0.091 0.646 4.244
N 0.192 2.526 8.679

0.159
0.185
0.342
0.152
0.177
0.324

0.909
0.890
0.702
0.916
0.898
0.712

Table 2.2 Edge vs. Off-edge Performance. We evaluate the depth performance for O-off edge,
W-whole image, N-near edge.

which consists of a pixel-wise mix of SSIM [276] and 𝐿1 loss between an input left image I versus
the reconstructed left image ˜I, which is re-sampled according to predicted disparity I ˆd. The 𝛼 is a

weighting hyperparameter as in [87, 278].

We minimize the distance between depth and segmentation edges by steering the disparity I ˆd

to approach the semantic-augmented disparity I∗
ˆd

(Eq. 2.10) in a logistic loss:

𝑙𝑔 (I ˆd(x)) = w(I ˆd(x)) · log(1 + |I∗
ˆd

(x) − I ˆd(x)|),

(2.14)

where w(·) is a function to downweight image regions with low variance.

It is observed that

the magnitude of the photometric loss (Eq. 2.13) varies significantly between textureless and rich

texture image regions, whereas the morph loss (Eq. 2.14) is primarily dominated by the border

consistency. Moreover, the morph is itself dependent on an estimated semantic psuedo ground truth

Is [333] which may include noise. In consequence, we only apply the loss when the photometric

loss is comparatively improved. Hence, we define the weighting function w(·) as:

w(I ˆd(x)) =





Var(I) (x)

If 𝑙𝑟 (I∗
ˆd

(x)) < 𝑙𝑟 (I ˆd(x))

0

otherwise,

(2.15)

where Var(I) computes pixel-wise RGB image variance in a 3 × 3 local window. Note that when a

noisy semantic estimation Is causes 𝑙𝑟 to degrade, the pixel location is ignored.

Following [278], we incorporate a stereo proxy loss 𝑙 𝑝 which we find helpful in neutralizing

noise in estimated semantics labels, defined similarly to Eq. 2.14 as:

𝑙 𝑝 (I ˆd(x)) =





log(1 + |Ip
ˆd

− I ˆd|)

If 𝑙𝑟 (Ip
ˆd

(x)) < 𝑙𝑟 (I ˆd(x))

0

otherwise,

(2.16)

20

Figure 2.5 Left axis: Metric 𝛿 < 1.25 as a function of distance off segmentation edges in background
(−𝑥) and foreground (+𝑥). compared to [278]. Right axis:
improvement distribution against
distance. Our gain mainly comes from near-edge background area but not restricted to it.

where Ip
ˆd

denotes the stereo matched proxy label generated by the Semi-Global Matching (SGM) [107,

108] technique.

Finetuning Loss: We further finetune the model to regularize the batch normalization [115]

statistics to be more consistent to an identity transformation. As such, the prediction becomes less

sensitive to the exponential moving average, following inspiration from [226] denoted as: 𝑙bn =
(cid:13)
′
(cid:13)
(cid:13)I ˆd(x) − I
ˆd
respectively.

denote predicted disparity with and without batch normalization,

′
, where I ˆd and I
ˆd

(x)

(cid:13)
(cid:13)
(cid:13)

2

2.3.3.2

Implementation Details

We use PyTorch [187] for training, and preprocessing techniques of [88]. To produce the

stereo proxy labels, We follow [278]. Semantic segmentation is precomputed via [333], in an

ensemble way with default settings at a resolution of 320 × 1,024. Using semantics definition

in Cityscapes [48], we set object, vehicle, and human categories as foreground, and the rest as

background. This allows us to convert a semantic segmentation mask to a binary segmentation

mask Is. We use a learning rate of 1𝑒−4 and train the joint loss (Eq. 2.12) for 20 epochs, starting

with ImageNet pretrained weights. After convergence, we apply 𝑙bn loss for 3 epochs at a learning

rate of 1𝑒−5. We set 𝑡 = 𝜆1 = 1, 𝜆2 = 5, 𝑘1 = 0.11, 𝑘2 = 20, 𝑘3 = 0.05, 𝑚1 = 17, 𝑚2 = 0.7,

𝑚3 = 1.6, 𝑚4 = 1.9, and 𝛼 = 0.85.

21

Figure 2.6 Input image and the disagreement of estimated disparity between our method and [278].
Our method impacts both borders (←) and inside (→) of objects.

2.4 Experiments

We first present the comprehensive comparison on the KITTI benchmark, then analyze our

results, and finally ablate various design choices of the proposed method.

KITTI Dataset: We compare our method against SOTA works on KITTI Stereo 2015 dataset,

a comprehensive urban autonomous driving dataset providing stereo images with aligned LiDAR

data. We utilize the eigen splits, evaluated with the standard seven KITTI metrics [71] with the

crop of Garg [82] and a standard distance cap of 80 meters [87]. Readers can refer to [71] for

explanation of used metrics.

Depth Estimation Performance: We show a comprehensive comparison of our method to the

SOTA in Tab. 2.1. Our framework outperforms prior methods on each of the seven metrics. For a

fair comparison, we utilize the same network structure as [88, 278]. We consider that approaching

the performance of supervised methods is an important goal of self-supervised techniques. Notably,

our method is the first self-supervised method matching SOTA supervised performance, as seen

in the absolute relative metric in Tab. 2.1. Additionally, We emphasize our method improves on

the 𝛿 < 1.25 from 0.890 to 0.898, thereby reducing the gap between supervised and unsupervised

22

Category

Method

Morph Abs Rel

Unsupervised Watson et al.[278]

Supervised

Lee et al. [137]

Stereo

Yin et al.[304]

✗
✓
✗
✓
✗
✓

0.097
0.096 ↓
0.088
0.088
0.049
0.049

Sq Rel
0.734
0.700 ↓
0.490
0.488 ↓
0.366
0.365 ↓

RMSE RMSE log
4.454
4.401 ↓
3.677
3.666 ↓
3.283
3.254 ↓

0.187
0.184 ↓
0.168
0.168
0.153
0.152 ↓

𝛿 < 1.25
0.889
0.891 ↑
0.913
0.913
0.948
0.948

𝛿 < 1.252
0.961
0.963 ↑
0.969
0.970 ↑
0.971
0.971

𝛿 < 1.253
0.981
0.982 ↑
0.984
0.985 ↑
0.983
0.983

Table 2.3 Comparison of algorithms if coupled with an segmentation network during inference.
Given the segmentation predicted at inference, we apply morph defined in Sec. 2.3.1.2 to depth
prediction. The improved metric is marked in green.

methods by relative ∼60% (= 1 − 0.904−0.898

0.904−0.890). We further demonstrate a consistent performance

gain with two variants of ResNet (Tab. 2.1), demonstrating our method’s robustness to the backbone

architecture capacity.

We emphasize our contributions are orthogonal to most methods including stereo and monocular

training. For instance, we use noisy segmentation predictions, which can be further enhanced by

pairing with stronger segmentation or via segmentation annotations. Moreover, recall that we do

not use the monocular training strategy of [88] or additional stereo data such as Cityscapes, and

utilize a substantially smaller network (e.g., 138.6 vs. 563.4 MB [137]), thereby leaving more room

for future enhancements.

Depth Performance Analysis: Our method aims to explicitly constrain the estimated depth edges

to become similar to segmentation counterparts. Yet, we observe that the improvements to the

depth estimation, while being emphasised near edges, are distributed in more spatial regions. To

understand this effect, we look at three perspectives.

Firstly, we demonstrate that depth performance is the most challenging near edges using the

𝛿 < 1.25 metric. We consider a point x to be near an edge point p if below averaged edge consistence

𝑙𝑐, that is | x − p |≤ 3. We demonstrate the depth performance of off-edge, whole image, and near

edge regions in Tab. 2.2. Although our method has superior performance on whole, each method

degrades near an edge (↓ ∼0.18 on 𝛿 from W to N), reaffirming the challenge of depth around

object boundaries.

Secondly, we compare metric 𝛿 < 1.25 against baseline [278] in the left axes of Fig. 2.5. We

observe improvement from background around object borders (px∼−5) and from foreground inside

23

Baseline + M + 𝑙𝑔

Loss
Baseline
Baseline + M

𝛿 < 1.253
0.982
0.982
0.982
0.982
0.983
0.983
Table 2.4 Ablation study of the proposed method. ✓ indicates morphing during inference.

Sq Rel RMSE RMSE log
0.754
0.762
0.736
0.714
0.692
0.674

Morph Abs Rel
0.102
0.101
0.099
0.098
0.098
0.097

𝛿 < 1.252
0.962
0.962
0.963
0.964
0.963
0.964

𝛿 < 1.25
0.884
0.887
0.889
0.890
0.889
0.891

0.187
0.186
0.185
0.183
0.182
0.180

4.499
4.489
4.462
4.421
4.393
4.354

✗
✗
✗
✓
✗
✓

Baseline + M + 𝑙𝑔 + Finetune

Figure 2.7 Compare the quality of estimated depth around foreground objects between [278] (top)
and ours (bottom).

objects (px ≥ 30). This is cross-validated in Fig. 2.6 which visualizes the disagreements between

ours and baseline [278]. Our method impacts near the borders (←) as well as inside of objects (→)

in Fig. 2.6.

Thirdly, we view the improvement as a normalized probability distribution, as illustrated in

right axes of Fig. 2.5. It peaks at around −5 px, which agrees with the visuals of Fig. 2.7 where

originally the depth spills into the background but becomes close to object borders using ours. Still,

the improvement is consistently positive and generalized to entire distance range. Such findings

reaffirm that our improvement is both near and beyond the edges in a general manner.

Depth Border Quality: We examine the quality of depth borders compared to the baseline [278],

as in Fig. 2.7. The depth borders of our proposed method is significantly more aligned to object

boundaries. We further show that for SOTA methods, even without training our models, applying

our morphing step at inference leads to performance gain, when coupled with a segmentation

network [333] (trained with only 200 domain images). As in Tab. 2.3, this trend holds for

unsupervised, supervised, and multi-view depth inference systems, implying that typical depth

24

Figure 2.8 (a) input image and segmentation, (b-e) estimated depth (top) and with overlaid segmen-
tation (bottom) for various ablation settings, as defined in Tab. 7.5.

Model

Finetune Abs Rel Sq Rel RMSE RMSE log 𝛿 < 1.25

Godard et al. [88]

Watson et al. [278]

✗
✓
✗
✓

0.104 0.775 4.562
0.103 0.731 4.531
0.096 0.710 4.393
0.094 0.676 4.317

0.191
0.188
0.185
0.180

0.878
0.878
0.890
0.892

Table 2.5 Improvement after finetuning of different models.

methods can struggle with borders, where our morphing can augment. However, we find that the

inverse relationship using depth edges to morph segmentation is harmful to border quality.

Stereo Occlusion Mask:

To examine the effect of our proposed stereo occlusion masking

(Sec. 2.3.2), we ablate its effects (Tab. 7.5). The stereo occlusion mask M improves the absolute

relative error (0.102 → 0.101) and 𝛿 < 1.25 (0.884 → 0.887). Upon applying stereo occlusion

mask during training, we observe the bleeding artifacts are significantly controlled as in Fig. 2.8

and in Suppl. Fig. 3. Hence, the resultant borders are stronger, further supporting the proposed

consistency term 𝑙𝑐 and morphing operation.

Morph Stabilization: We utilize estimated segmentation [333] to define the segmentation-depth

edge morph. Such estimations inherently introduce noise and destablization in training for which

we propose a w(x) weight to provide less attention to low image variance and ignore any regions

which degrades photometric loss (Sec. 2.3.3.1). Additionally, we ablate the specific help from

stereo proxy labels in stabilizing training in Fig. 2.8 (d) & (e) and Suppl. Fig. 3.

Finetuning Strategy: To better understand the effect of our finetuning strategy (Sec. 2.3.3.1)

on performance, we ablate using [88, 278] and our method, as shown in Tab. 7.5 and 2.5.

25

Figure 2.9 Comparison of depth of initial baseline (b), triangularization (c), and proposed morph
(d).

Method

Sq Rel RMSE RMSE log 𝛿 < 1.25

Ours (Triangularization) 0.697 4.379
0.686 4.368

Ours (Proposed)

0.180
0.180

0.895
0.895

Table 2.6 Our morphing strategy versus triangularization.

Each ablated method achieves better performance after applying the finetuning, suggesting the

technique is general.

Morphing Strategy: We explore the sensitivity of our morph operation (Sec. 2.3.1), by comparing

its effectiveness against using triangularization to distill point pair relationships. We accomplish this

by first forming a grid over the image using anchors. Then define corresponding triangularization

pairs between the segmentation edge points paired with two anchors. Lastly, we compute an affine

transformation between the two triangularizations. We analyze the technique vs. our proposed

morphing strategy qualitatively in Fig. 2.9 and quantitatively in Tab. 2.6. Although the methods

have subtle distinctions, the triangularization morph is generally inferior, as highlighted by the

RMSE metrics in Tab. 2.6. Further, the triangularization morphing forms boundary errors with

acute angles which introduce more noise in the supervision signal, as exemplified in Fig. 2.9.

2.5 Conclusions

We present a depth estimation framework designed to explicitly consider the mutual benefits

between two neighboring computer vision tasks of self-supervised depth estimation and semantic

segmentation. Prior works have primarily considered this relationship implicitly. In contrast, we

propose a morphing operation between the borders of the predicted segmentation and depth, then

use this morphed result as an additional supervising signal. To help the edge-edge consistency

quality, we identify the source problem of bleeding artifacts near object boundaries then propose

26

a stereo occlusion masking to alleviate it. Lastly, we propose a simple but effective finetuning

strategy to further boost generalization performance. Collectively, our method advances the state

of the art on self-supervised depth estimation, matching the capacity of supervised methods, and

significantly improves the border quality of estimated depths.

27

CHAPTER 3

PMATCH: PAIRED MASKED IMAGE MODELING FOR DENSE GEOMETRIC
MATCHING

Dense geometric matching determines the dense pixel-wise correspondence between a source

and support image corresponding to the same 3D structure. Prior works employ an encoder of

transformer blocks to correlate the two-frame features. However, existing monocular pretraining

tasks, e.g., image classification, and masked image modeling (MIM), can not pretrain the cross-

frame module, yielding less optimal performance. To resolve this, we reformulate the MIM from

reconstructing a single masked image to reconstructing a pair of masked images, enabling the

pretraining of transformer module. Additionally, we incorporate a decoder into pretraining for

improved upsampling results. Further, to be robust to the textureless area, we propose a novel

cross-frame global matching module (CFGM). Since the most textureless area is planar surfaces,

we propose a homography loss to further regularize its learning. Combined together, we achieve

the State-of-The-Art (SoTA) performance on geometric matching.

3.1

Introduction

When a 3D structure is viewed in both a source and a support image, for a pixel (or keypoint) in

the source image, the task of geometric matching identifies its corresponding pixel in the support

image. This task is a cornerstone for many downstream vision applications, e.g. homography

estimation [65], structure-from-motion [209], visual odometry estimation [72] and visual camera

localization [28].

There exist both sparse and dense methods for geometric matching. The sparse methods [67,

199, 255, 160, 63, 201, 152, 234, 234] only yield correspondence on sparse or semi-dense locations

while the dense methods [252, 250, 68] estimate pixel-wise correspondence. They primarily differ

in that the sparse methods embed a keypoint detection or a global matching on discrete coordinates,

which underlyingly assumes a unique mapping between source and support frames. Yet, the exis-

tence of textureless surfaces introduces multiple similar local patches, disabling keypoint detection

or causing ambiguous matching results. Dense methods, though facing similar challenges at the

28

Figure 3.1 Most vision tasks start with a pretrained network. In geometric matching, the unique
network components processing two-view features cannot benefit from the monocular pretraining
task, e.g., image classification, and masked image modeling (MIM). As in the figure, this work
enables the pretraining of a matching model via reformulating MIM from reconstructing a single
masked image to reconstructing a pair of masked images.

coarse level, alleviate it with the additional fine-level local context and smoothness constraint. Until

recently, the dense methods demonstrate a comparable or better geometric matching performance

over the sparse methods [252, 250, 68].

A relevant task to dense geometric matching is the optical flow estimation [241]. Both tasks

estimate dense correspondences, whereas the optical flow is applied over consecutive frames with

the constant brightness assumption.

In geometric matching [234, 38], apart from the encoder encodes source and support frames

into feature maps, there exist transformer blocks which correlate two-frame features, e.g., the

LoFTR module [234]. Since these network components consume two-frame inputs, the monocular

pretraining task, e.g., the image classification and masked image modeling (MIM) defined on

ImageNet dataset, is unable to benefit the network. This limits both the geometric matching

performance and its generalization capability.

To address this, we reformulate the MIM from single masked image reconstruction to paired

masked images reconstruction, i.e., pMIM. Paired MIM benefits the geometric matching as both

tasks rely on the cross-frame module to correlate two frames inputs for prediction.

With a pretrained encoder, the decoder in dense geometric matching is still randomly initialized.

Following the idea of pretraining encoder, we extend pMIM pretraining to the decoder. As part

functionality of decoder is to upsample the coarse-scale initial prediction to the same resolution as

29

input, we also task the decoder in pMIM to upsample the coarse-scale reconstruction to its original

resolution. Correspondingly, we consist the decoder as stacks of the depth-wise convolution except

for the last prediction head. With the depth-wise decoder, when transferring from pMIM to geo-

metric matching, we duplicate the decoder along the channel dimension to finish the initialization.

To this end, there exists only a small number of components in the decoder randomly initialized,

we pretrain the rest network components using synthetic image pair augmentation [250].

To further improve the dense geometric matching performance, we propose a cross-frame

global matching module (CFGM). In CFGM, we first compute the correlation volume. We model

the correspondences of coarse scale pixels as a summation over the discrete coordinates in the

support frame, weighted by the softmaxed correlation vector. However, this modeling fails when

multiple similar local patches exit. As a solution, we impose positional embeddings to the discrete

coordinates and decode with a deep architecture to avoid ambiguity. Meanwhile, we notice that

the textureless surfaces are mostly planar structures described by a low-dimensional 8 degree-of-

freedom (DoF) homography matrix. We thus design a homography loss to augment the learning

of the low DoF planar prior.

We summarize our contributions as follows:

• We introduce the paired masked image modeling pretext task, pretraining both the encoder and

decoder of a dense geometric matching network.

• We propose a novel cross-frame global matching module that is robust to textureless local

patches. Since the most textureless patches are planar structures, we augment their learning with a

homography loss.

• We outperform dense and sparse geometric matching methods on diverse datasets.

3.2 Related works

3.2.1 Pretraining and Finetuning

Pretraining and finetuning is an effective paradigm in vision tasks. Supervised image classifi-

cation has been one of the most widely adopted pretraining methods. An encoder [104, 225, 112],

e.g., ResNet [104], together with a few fully connected (FC) layers is trained for image classifica-

30

Figure 3.2 Methodology Overview. In (a), we illustrate the proposed dense geometric matching
network. After extracting the multi-scale feature with the encoder 𝐸𝜃, we extend the LoFTR module
with (1) Transformer blocks 𝑇𝜃 and (2) positional embeddings with an appended decoder 𝐷𝜃 to
remove the ambiguity when multiple local patches exist. In (b), we show the proposed paired MIM
pretext task. We apply image masking at the scale 𝑠 = 2, and recover the masked images with the
transformer blocks. In (a), network 𝐷𝜃 (in red) is not included in pMIM pretraining. In dense
matching, 𝑅𝜃 takes in the stack of source and the aligned support frame feature. In the pretext task,
𝑅′
𝜃 only takes in the source frame feature. Thus, 𝑅′
𝜃 is a sub-graph of 𝑅𝜃. We detail how to initialize
𝜃 in Fig. 3.3. The residual refinement at other scales repeats the process at scale 𝑠 = 8
𝑅𝜃 using 𝑅′
but consumes feature embeddings of other scales, skipped for simplicity.

tion using a large-scale dataset, e.g., ImageNet [59]. After converging, the encoder is used as the

initialization in the downstream vision tasks.

Apart from supervised classification tasks, there are self-supervised methods producing dis-

criminative feature representation. Inspired by BYOL [92], DINO [34] introduces a self-supervised

mean-teacher knowledge distillation task. It encourages the prediction consistency between a stu-

dent and teacher model where the teacher is an exponential moving average of the student model.

The pretrained ViT model embeds explicit information of semantic segmentation, which is not

observed in a supervised counterpart. Other self-supervised pretraining methods include color

transformation [44], geometric transformation [44], Jigsaw Puzzle [171], feature frame predic-

tion [185], etc.

Among the self-supervised learning tasks, masked image modeling (MIM) [261, 290, 8, 324,

294, 103] achieves SoTA finetuning performance on ImageNet [59]. The task introduces Masked

Language Modeling used in NLP domain to vision, reconstructing an image from its masked input.

31

Figure 3.3 Resolution of the Discrepancy between 𝑅𝜃 and 𝑅′
𝜃. We adopt stacks of the depth-wise
convolution in the refinement module, i.e., each convolution kernel only works with one channel of
the input feature maps. This makes refiner 𝑅′
𝜃 in pretexting a sub-graph of refiner 𝑅𝜃 in finetuning.
While transferring from the pretexting task to finetuning task, the input feature map concatenates
, 𝑇 𝑠). As the bilinear sampling 𝑓 imposes minimal
an extra aligned support frame feature 𝑓 (𝜑𝑠
2
distribution change, we duplicate the kernel weight along the channel dimension.

While iGPT [39], ViT [64], and BEiT [8] adopt sophisticated paradigm in modeling, MAE [103]

and SimMIM [292] show that directly regressing the masked continuous RGB pixels can achieve

competitive results. Typically, they focus on pretraining the encoder, adopting an asymmetric

design where only a shallow decoder head is appended.

In this paper, we reformulate MIM from reconstructing a single image to the paired images,

reducing the domain gap between the pretexting task and the downstream geometric matching. As

a result, we extend the benefit of MIM pretraining to the task of dense geometric matching.

3.2.2 Sparse Geometric Matching

There are detector-based and detector-free sparse geometric matching methods. Classic works

are detector based, and employ the nearest neighbor (NN) match using the hand-crafted feature on

detected keypoints, e.g., SIFT [160], SURF [14], and ORB [202]. Both keypoint detection and

feature extraction are improved by data-driven deep models [63, 67, 184, 199, 302, 63]. Later,

[204, 201, 255] propose to replace the naive NN match by graph neural network based differentiable

matching.

While the detector based methods operate on keypoints, the detector free methods, e.g. LoFTR [234]

and ASpanFormer [38] operate all-to-all matching on coarse-scale discrete grid locations. Still,

their matching depends on the correlation between features, yielding ambiguous results when mul-

tiple local patches exist. We improve LoFTR from two perspectives. First, we extend the LoFTR

module to the proposed cross-frame global matching module to benefit from the MIM pretexting

32

Figure 3.4 Visual Quality of the paired MIM pretext task. Visualized cases are from the
MegaDepth and the ScanNet dataset.

task. Second, we alleviate the ambiguity caused by similar local patches by imposing positional

embeddings over the low-dimensional 2D coordinates. A decoder is then employed to resolve the

ambiguity.

3.2.3 Dense Geometric Matching

DGC-Net [168] regresses dense correspondences from a global correlation volume at a lim-

ited resolution. GLU-Net [249] increases the resolution with a global-local correlation layer.

GOCor [248] further improves GLU-Net [249] by replacing the correlation layer with online

optimization. Other methods, such as RANSAC Flow [217], iteratively recover a homography

transformation to reduce the visual difference between the source and support images.

Though dense methods estimate more correspondences than sparse methods, it is less favored

for geometric matching. Until recently, PDC Net+ [250] and DKM [68] close the gap between

dense and sparse methods. Both methods model the dense match as probability functions. PDC

Net+ adopts a mixture Laplacian distribution while DKM models with the Gaussian Process (GP).

Furthermore, they estimate a confidence score to remove false positive results. We follow [250, 68]

in the confidence estimation. However, instead of applying probabilistic regression, we keep the

correlation based explicit matching process. This saves the computation of the inverse matrix

required in the GP Regression of DKM. Also, we apply a unique architecture design to benefit from

the MIM pretexting task.

33

Figure 3.5 Visual Quality of the Reconstruction. We visualize 4 reconstructed images using
estimated dense correspondences. In each group, from left to right is the source image, support
image, and the reconstructed image. The areas of low confidence are filled with white color.
In ScanNet where the confidence groundtruth is not available, we use forward-backward flow
consistency mask as a replacement.

3.3 Method

In this section, we first introduce the proposed dense geometric matching method. Then we

discuss how to pretext the network via the paired masked image modeling. Fig. 6.3 depicts our

framework in finetuning and pretexting stages.

3.3.1 Dense Geometric Matching

Dense geometric matching computes the dense correspondences between the source image I1

and support image I2. Under the estimated correspondences 𝑇, source image I1 can be recovered

from support image I2 by applying bilinear sampling at 𝑇. Since the dense correspondences

between I1 and I2 is not guaranteed to exist at each pixel location, we follow [68] in estimating

confidence 𝑃 to indicate the fidelity of the prediction.

Feature Extraction. As shown in Fig. 6.3, we adopt a multi-scale ResNet-based [104] feature

extractor 𝐸𝜃. Taking the source frame I1 as an example, we produce the multiscale feature

embeddings as:

{𝜑𝑠=2
1

, 𝜑𝑠=4
1

, 𝜑𝑠=8
1

} = 𝐸𝜃 (I1).

(3.1)

For the input image I1 of resolution 𝐻 × 𝑊, the scale 𝑠 indicates a feature map of resolution

𝐻/𝑠 × 𝑊/𝑠.

Cross-Frame Global Matching The cross-frame global matching module (CFGM) is designed

to accomplish coarse-scale geometric matching. To benefit from the MIM pretext task, we first

34

process the scale 𝑠 = 8 feature map 𝜑𝑠=8

1 with the transformer block [123]:

{𝜑𝑠=8
1

′, 𝜑𝑠=8
2

′

} = 𝑇𝜃 (𝜑𝑠=8

1

, 𝜑𝑠=8
2

).

(3.2)

In the pretraining stage, the masked feature map is recovered by the appended transformer blocks.

Then, we follow LoFTR [234] in using linear transformer blocks to correlate the source and support

frame feature:

{𝜑𝑠=8
1

, 𝜑𝑠=8
2

} = 𝐿𝜃 (𝜑𝑠=8

1

′

′, 𝜑𝑠=8
2

).

To compute the global matching results, we first compute the 4D correlation volume C

R𝐻/8×𝑊/8×𝐻/8×𝑊/8, where:

𝐶𝑖 𝑗 𝑘𝑙 =

∑︁

ℎ

1
𝛾

(cid:17)

(cid:16)

𝜑𝑠=8
1

𝑖 𝑗 ℎ

(cid:17)

(cid:16)

·

𝜑𝑠=8
2

,

𝑘𝑙ℎ

(3.3)

(cid:16)

𝜑𝑠=8
1

, 𝜑𝑠=8
2

(cid:17)

∈

(3.4)

where 𝛾 is a temperature scalar. The coarse matches are computed as a summation over pixel

locations X ∈ R(𝐻/8)(𝑊/8)×2 weighted by the softmaxed correlation volume. That is, after the

correlation volume C being reshaped to C ∈ R(𝐻/8)(𝑊/8)×(𝐻/8)(𝑊/8), we apply the softmax:

(cid:102)𝐶𝑖 𝑗 = softmax(𝐶𝑖 𝑗 ).

(3.5)

Here, element 𝐶𝑖 𝑗 is a size (𝐻/8)(𝑊/8) × 1 vector. We conclude the coarse global matching results

as:

𝑇 𝑠=8
∗

= (cid:101)C × X.

(3.6)

Note, Eqn. 3.6 will cause ambiguous results when multiple similar textureless local patches exist,

i.e., multiple peak values in softmaxed correlation vector (cid:102)𝐶𝑖 𝑗 . To resolve this, we modify Eqn. 3.6

with:

𝑇 𝑠=8
∗

, 𝑃𝑠=8

∗ = 𝐷𝜃

(cid:16)

(cid:101)C × 𝑀 (X)

(cid:17)

,

(3.7)

where 𝑀 (X) is cosine positional embeddings with learnable tokens [234, 68], projecting the 2D

pixel locations to a high dimensional space to avoid ambiguity when multiple similar patches exist.

The decoder 𝐷𝜃 decodes 𝑇 𝑠=8

∗

, initial correspondences estimation at scale 𝑠 = 8, and 𝑃𝑠=8

∗

, initial

confidence estimation.

35

Methods

Venue

Dense Match PCK ↑

Run-

RANSAC-FLow [217]
PDC-Net [314]
PDC-Net+ [250]
LIFE [113]

ECCV’20
CVPR’21
Arxiv’21
Arxiv’21

GLU-Net-GOCor [248] NeurIPS’20
CVPR’21
Arxiv’21
CVPR’23

PDC-Net [314]
PDC-Net+ [250]
PMatch (Ours)

@1 px @3 px @5 px time (ms)
53.47
71.81
74.51
39.98
57.77
68.95
72.41
79.83

3, 596
1, 017
1, 017
78
71
88
88
124

83.45
89.36
90.69
76.14
78.61
84.07
86.70
95.18

86.81
91.18
92.10
83.14
82.24
85.72
88.12
96.52

Table 3.1 MegaDepth Dense Geometric Matching. The running time of all methods is measured
at the resolution 480 × 480. The upper and lower groups are methods running multiple or single
times. [Key: Red color marks Best, Blue color marks the Second Best]

Multi-Scale Refinement We follow [68] in using the multi-scale refinement module:

Δ𝑇 𝑠, Δ𝑃𝑠 = 𝑅𝜃 (𝜑𝑠
1

, 𝑓 (𝜑𝑠
2

, 𝑇 𝑠)),

(3.8)

where function 𝑓 (·) indicates the bilinear interpolation to align the support frame feature using the

current estimated correspondences 𝑇 𝑠, shown in Fig. 6.3. To accommodate the transfer between

pretexting and finetuning stage, we apply depth-wise convolution [68] in 𝑅𝜃. We detail the

discussion in Fig. 3.3 and Sec.3.3.2. The correspondences and confidence on the next scale are

initialized with the bilinear upsampling.

3.3.2 Paired MIM Pretraining

Paired Masked Image Modeling (MIM) MIM is extensively adopted in image classification

task [103, 292]. An image classification network can be further improved after MIM pretexting.

As shown in Fig. 7.1 and 3.4, the network reconstructs the input from randomly masked feature

embeddings at a specific scale.

In this work, we investigate the benefit of pretraining both the

encoder and decoder under MIM. Compared to only pretraining the encoder, pretraining the whole

network further reduces the domain gap between pretexting and finetuning tasks.

Masking Strategy We follow SimMIM [292] in using randomly selected 32 × 32 mask patches

with a predefined masking ratio 𝑟1 and 𝑟2 for source and support frames. For source view, given the
feature embeddings 𝜑𝑠=2

output by the extractor 𝐸𝜃 at scale 𝑠 = 2, we apply the randomly generated

1

mask w to mask out the feature embeddings, i.e.:

′

𝜑𝑠=2
1

= 𝜑𝑠=2
1

∗ (1 − w) + x ∗ w,

(3.9)

36

Category

Methods

Venue

Sparse

Sparse
W/ Detector

CVPR’19
SuperGlue [204]
Pattern’20
SGMNet [144]
ICASSP’22
DRC-Net [152]
CVPR’21
LoFTR [234]
ICLR’22
QuadTree [239]
Wo/ Detector MatchFormer [271] ACCV’22
ECCV’22
ASpanFormer [38]
Arxiv’19
PDC-Net+ [250]
CVPR’23
DKM [68]
CVPR’23
PMatch (Ours)

Dense

Pose Estimation AUC ↑
@5◦ @10◦ @20◦
75.9
61.2
42.2
72.6
59.0
40.5
58.3
42.9
27.0
81.2
69.2
52.8
82.2
70.5
54.6
81.8
69.7
53.3
83.1
71.5
55.3
76.1
61.9
43.1
85.1
74.9
60.5
85.7
75.7
61.4

Table 3.2 MegaDepth Two-View Camera Pose Estimation. We compare three groups of methods
following SuperGlue [204] in evaluation. The pose AUC error is reported. Our method shows
substantial improvement. [Key: Red color marks Best, Blue color marks the Second Best]

where x is the learnable mask tokens. Note, our extractor 𝐸𝜃 starts from a 3 × 3 convolution kernel

to avoid leakage of the masked patches.

Prediction Heads Different from SimMIM [292], our prediction heads include most network

components of the decoder. We complete the masked feature embeddings with the transformer as:

′

𝜑𝑠=8
1

= 𝑇𝜃 (𝜑𝑠=8

1

).

(3.10)

Here, we use the same notation as Eqn. 3.2 since both indicate image features at the scale 𝑠 = 8.

Note that the subsequent network component LoFTR is a series of linear transformer blocks [123]

which reduce the quadratic computational complexity to linear. However, empirically we find the

linear transformer poorly recovers the masked patches. We thus append the transformer blocks.

As shown in Fig. 6.3, after Eqn. 3.10, we feed the completed feature map to CFGM. Note the

refiner between the two stages is different. Instead of taking a stacked feature map (Eqn. 3.8), in

pretexting we only take in a single feature map:

ΔI𝑠

1 = 𝑅′

𝜃 (𝜑𝑠

1), ΔI𝑠

2 = 𝑅′

𝜃 (𝜑𝑠

2).

(3.11)

To account for the difference between Eqn. 3.8 and Eqn. 3.11, we apply depth-wise convolution,

where each convolution kernel operates on one channel of the feature map, shown in Fig. 3.3.

Since 𝑓 (𝜑𝑠
2

, 𝑇 𝑠) in Eqn. 3.8 is a resampled support frame feature, it imposes minimal distribution

difference to 𝜑𝑠

2. Then, while transferring from the pretexting task to the downstream task, we
only need to duplicate the channel of 𝑅𝜃 to complete the initialization. We follow SimMIM [292]

37

(a) Source Frame I1 (b) Supp. Frame I2

(c) PMatch (Ours)

(d) DKM [68]

(e) LoFTR [234]

Figure 3.6 Visual Comparisons. We conduct the visual comparison against the SoTA dense [68]
and sparse [234] methods on the MegaDepth and the ScanNet datasets. The color from blue to red
indicates an increment in the end-point-error (L2 error).

in estimating full resolution residual RGB images in each scale of the decoder. We visualize the

reconstructed paired masked images in Fig. 3.4.

Network Components not included in pMIM Since the feature map at 𝑠 = 2 contains little

information about masked patches, the pretraining only includes refinement modules at scale 𝑠 = 4

and 𝑠 = 8. Furthermore, the CFGM decoder 𝐷𝜃 and part of 𝑅𝜃 are not included. We pretrain the

rest network component with synthetic image pairs [250].

Prediction Objective Set the accumulated reconstruction at each scale 𝑠 as I𝑠, we regress the raw

pixel value with an 𝑙1 loss:

L𝑀 =

∑︁

𝑠

1
𝑁

where 𝑁 is the number of unmasked pixels.

3.3.3 Dense Geometric Matching Loss

(|I𝑠

1 − I1|1 + |I𝑠

2 − I2|1),

(3.12)

Homography Loss The image correspondences between two planar structures are constrained by

a 3 × 3 homography matrix H with 8 DoF. Compared to correspondences estimation over arbitrary

shapes, the correspondences in planar structures possess a lower rank. Given a surface normal n

38

computed using the depth gradient [177], the homography of the pixel can be computed as:

H =

⊺
h
1
⊺
h
2
⊺
h
3























= K1

(cid:18)

R +

(cid:19)

t⊤
𝑑 n

K−1
2

,

(3.13)

where the K1 and K2 are intrinsic matrices of I1 and I2, R and t are camera rotation and translation,

and 𝑑 is the pixel depth. We randomly sample 𝐾 anchor points {p𝑚 | 1 ≤ 𝑚 ≤ 𝐾 }. For each anchor

point p𝑚, we sample 𝐾 candidate points {q𝑚

𝑛 | 1 ≤ 𝑛 ≤ 𝐾 }. We determine a co-planar indicator

matrix O+ of size 𝐾 × 𝐾 to suggest all co-planar pairs. We use the normal consistency, point-

to-plane distance, and homography consistency to compute the co-planar groundtruth, detailed

in Supp. Finally, we apply a gradient-based penalty, penalizing the correspondences difference

between the estimation and the groundtruth.

L𝑠

ℎ =

1
|O+|

∑︁

(cid:16)
𝑇 𝑠
p − 𝑇 𝑠

q

(cid:17)

|

−

(cid:16)
𝑇

(cid:17)

𝑠
p − 𝑇

𝑠
q

|1.

O+

p,q=1

(3.14)

Global Matching Loss Following [234], we minimize a binary cross-entropy loss over the corre-

lation volume C after a dual-softmax operation:

′

(cid:157)𝐶𝑖 𝑗 𝑘𝑙

= softmax(𝐶𝑖 𝑗 ) · softmax(𝐶𝑘𝑙),

where 𝐶𝑖 𝑗 and 𝐶𝑘𝑙 are (𝐻/8)(𝑊/8) × 1 vectors. The loss is defined as:

L𝑔 = −

1
|M+|

−

1
|M−|

∑︁

′

log(cid:157)𝐶𝑖 𝑗 𝑘𝑙

𝑖 𝑗 𝑘𝑙∈M+
∑︁

𝑖 𝑗 𝑘𝑙∈M −

(cid:16)

log

1 − (cid:157)𝐶𝑖 𝑗 𝑘𝑙

′(cid:17)

,

(3.15)

(3.16)

where M+ and M− are groundtruth indicator matrix of size 𝐻 × 𝑊 × 𝐻 × 𝑊 indicating whether a

source frame pixel (𝑖, 𝑗) pairs with a target frame pixel (𝑘, 𝑙).

Refinement Loss Following [68], we supervise both correspondences and confidence on each

scale of the predictions,

L𝑠

𝑟 =

1
|𝑃+|

∑︁

𝑖 𝑗 ∈𝑃+

(cid:12)
𝑇 𝑠
𝑖 𝑗 − 𝑇
(cid:12)
(cid:12)

𝑠
𝑖 𝑗

,

(cid:12)
(cid:12)
(cid:12)2

39

(3.17)

Category

Methods

Venue

Sparse

Sparse
W/ Detector

CVPR’19
SuperGlue [204]
PR’20
SGMNet [144]
ICASSP’22
DRC-Net [152]
CVPR’21
LoFTR [234]
ICLR’22
QuadTree [239]
Wo/ Detector MatchFormer [271] ACCV’22
ECCV’22
ASpanFormer [38]
Arxiv’19
PDC-Net+ [250]
CVPR’23
DKM [68]
CVPR’23
PMatch (Ours)

Dense

Pose Estimation AUC ↑
@5◦ @10◦ @20◦
51.8
33.8
16.2
48.3
32.1
15.4
30.5
17.9
7.7
57.6
40.8
22.0
61.8
44.7
24.9
61.4
43.9
24.3
63.3
46.0
25.6
57.1
39.4
20.2
68.3
50.7
29.4
67.4
50.1
29.4

Table 3.3 ScanNet Two-View Camera Pose Estimation. We follow SuperGlue [204] in the testing
protocol. The pose AUC error is reported. Our method achieves clear improvement over other
baselines. [Key: Red color marks Best, Blue color marks the Second Best]

where 𝑃+

𝑖 𝑗 is a 𝐻 × 𝑊 matrix that indicates whether a valid pair is found at pixel location 𝑖 𝑗 in the

source frame. Similarly, the loss of confidence is defined as:

L𝑠

𝑐 = −

1
|P+|

∑︁

𝑖 𝑗 ∈𝑃+

log(𝑃𝑖 𝑗 ) −

1
|P−|

∑︁

𝑖 𝑗 ∈𝑃−

log(1 − 𝑃𝑖 𝑗 ).

(3.18)

Total Loss The total loss is a weighted summation of proposed losses:

L =

1
4

∑︁

𝑠

(𝐿𝑠

𝑟 + 𝑤𝑐L𝑠

𝑐) + 𝑤𝑔 · L𝑔 +

1
4

∑︁

𝑤ℎ

L𝑠
ℎ.

𝑠

(3.19)

The constant 4 comes from the four scales 𝑠 = {1, 2, 4, 8} set in our paper.

3.4 Experiments

We first compare with other SoTA dense matching methods on the MegaDepth dataset. Then,

to comprehensively reflect the contributions from both the density and accuracy of geometric

matching, we follow [234, 68] in using the two-view relative camera pose estimation performance as

the metric. We report on both the outdoor scenario MegaDepth [146] dataset and the indoor scenario

ScanNet [52] dataset. We additionally evaluate on the HPatches [7] and the YFCC100m [243]

datasets to demonstrate the generalizability of the model.

3.4.1

Implementation Details

Pretext stage From DeMoN [258], BlendedMVS [300], HyperSim [200], ARKitScenes [13], and

TartanAir [275] datasets, we collect a pretraining dataset of 1, 281, 167 image pairs, i.e., the same

size as ImageNet [59]. Each pair is collected with a fixed frame index interval. In the pretraining

40

Methods

Venue

RANSAC-Flow [217] ECCV’20
CVPR’21
Arxiv’21
ICCV’19
CVPR’21
CVPR’21
Arxiv’21
ECCV’22
CVPR’23

PDC-Net [252]
PDC-Net+ [250]
OANet [54]
CoAM [282]
PDC-Net [252]
PDC-Net+ [250]
ASpanFormer [38]
PMatch (Ours)

Pose Estimation AUC ↑ Pose Estimation mAP ↑
@5◦ @10◦ @20◦ @5◦ @10◦ @20◦
81.6
81.2
84.6
-
-
80.3
82.7
-
89.3

73.3
73.0
76.6
-
66.8
70.9
73.8
-
83.1

64.9
63.9
67.4
52.2
55.6
60.5
63.9
-
75.9

-
55.8
58.1
-
-
52.6
55.4
63.8
65.2

-
35.7
37.5
-
-
32.2
34.8
44.5
45.7

-
72.3
74.5
-
-
70.1
72.6
78.4
79.8

Table 3.4 YFCC100m Two-View Camera Pose Estimation. The upper group runs multiple times,
while the lower group runs a single time. We follow [314] in the evaluation and preprocessing,
reporting both pose AUC and mAP errors. [Key: Red color marks Best, Blue color marks the
Second Best]

dataset, we train the model using a batchsize of 128 under the resolution 192 × 256. We use the

Adam optimizer [127] with a learning rate 2𝑒−4, running for 250k steps on 2× A100 GPUs. We

stack 1 transformer layer. We initialize the masking ratio 𝑟1 = 75% and 𝑟2 = 75%. The masking

operation applies to the ResNet, causing significantly different batch statistics between masked

and unmasked inputs. Since the downstream task takes the unmasked image, we linearly reduce

the support frame masking ratio 𝑟2 to 0 and use a different batch normalization layer for support

view, resolving the batch statistics difference. We also apply the synthetic image pair augmentation

introduced in [250].

Finetuning stage Our model trains with a batchsize of 16 at the resolution 544 × 720. The learning

rate is set to 4𝑒−4, running 250k steps with a warmup of 25k steps. On 4× A100 GPUs, we train

for 5 days with the Adam optimizer. We follow [234] in sampling the paired images, weighted by

the sequence length and overlap ratio. The softmax temperature 𝛾 is 0.1. We set loss weight 𝑤𝑔 to

0.7 and 𝑤ℎ to 0.02. We sample 600 × 600 points for homography loss 𝐿 ℎ.

3.4.2 Datasets

MegaDepth MegaDepth [146] collects over 10 thousand images of worldwide landmarks from the

Internet. The collected images are processed by COLMAP [209] to produce groundtruth poses and

depthmaps. The dataset collects images of significant visual contrast due to lighting conditions,

view angles, and imaging devices. This imposes challenges to geometric matching.

41

ScanNet [52] is a large-scale indoor dataset with 1, 613 videos captured by RGB-D cameras. There

are challenging textureless indoor scenes for geometric matching.

YFCC100m [243]

is a large multi-media dataset. A subset of 72 reconstructions of tourist

landmarks is generated with groundtruth poses and depthmap.

Hpatches provides the pair of one source and five support images taken under different view angles

and lighting conditions with groundtruth homography transformation.

3.4.3 Dense Geometric Matching

We follow the RANSAC-Flow [217] in training and testing split on the MegaDepth dataset.

The PCK scores in Tab. 3.1 refer to the thresholded keypoints accuracy. We divide the baseline

methods into single and multiple run methods. Note, the baseline methods PDC Net [252] and

PDC Net+ [250] consume the additional synthetic data generated using COCO [149] instance

segmentation label. For PCK @1px, we outperform the SoTA single and multiple run methods

by an absolute margin of 4.89% and 6.99% respectively. Meanwhile, we are about 8× faster than

SoTA baselines while suppassing SoTA performance.

3.4.4 Two-View Camera Pose Estimation

Evaluation Protocol In the MegaDepth, ScanNet, and Hpatches datasets, we follow the evaluation

protocol of [204, 234, 68] in reporting the pose accuracy AUC curve thresholded at 5, 10, and 20

degrees. In the YFCC100m dataset, we follow the protocol of RANSAC-Flow [217], additionally

reporting the pose mAP value. The pose estimation is considered an outlier if its maximum degree

error of translation or rotation exceeds the threshold. The two-view relative pose is estimated using

the five-point algorithm [181] with RANSAC [62] via the OpenCV implementation [27].

Baseline Methods We compare with three groups of the methods, i.e., sparse methods with

detector [204, 144], sparse methods without detector [152, 234, 239, 271, 38] and dense meth-

ods [250, 68, 217, 252, 54, 282]. For sparse detector based methods, we use SuperPoint [63] as the

keypoint detector. For dense methods, we further categorize them into single-run and multiple-run

methods. For multiple-run methods, e.g., RANSAC-Flow [217], it repeats the prediction while

reducing the visual difference with an estimated homography transformation. Among baselines,

42

AspanFormer [38] is a recent publicly available sparse detector-free method, improving LofTR

with a sophisticated attention mechanism.

Outdoor Dataset We test our method on the outdoor dataset MegaDepth. We follow the training

and validation split of [204, 234, 68]. The evaluation split contains 1, 500 paired images randomly

selected from the scene 0015 and 0022. As shown in Tab. 3.2, we achieve an absolute improvement

of 0.9% over the recent SoTA dense method DKM [68]. Compared to the SoTA sparse method

ASpanFormer [38], we maintain an improvement of 6.1%.

Indoor Dataset We test our method on the indoor dataset ScanNet. We follow [68] in training and

testing protocol, resizing images to 480 × 640. The validation split of ScanNet consists of 1, 500

image pairs [204]. In Tab. 7.3, we maintain competitive performance with the SoTA dense method

DKM [68] and outperform SoTA sparse method by 1.4%.

Generalization to YFCC100m We use the MegaDepth trained model to test on YFCC100m [243]

dataset. We follow the preprocessing steps of [314], evaluated on 4 scenes with a total of 1, 000

images. During the evaluation, we resample the input images of the shorter side to 480. Tab. 3.4

shows that our method can achieve a superior generalization ability, maintaining an improvement

of 1.2% over SoTA sparse methods [38].

Generalization to HPatches Following LoFTR [234], we test the MegaDepth dataset trained model

on HPatches. In evaluation, the homography matrix is estimated using OpenCV’s implementation.

We compare correspondences accuracy computed using the groundtruth and estimated homography.

The image pairs in HPatches have lighting differences or view differences. The pattern is different

from the training dataset MegaDepth. Under the unseen testing scenario, our model generalizes

best among baselines.

3.5 Ablation Study

Qualitative Comparison The visual quality of reconstructed images using the predicted corre-

spondences is visualized in Fig. 3.5. We conduct a visual comparison with other SoTA dense and

sparse methods in Fig. 3.6. In Row 1, (c), and (d), compared to DKM [68], the proposed CFGM

module achieves correct initial correspondences. In Row 1, (c), and (e), compared to LoFTR [234],

43

Category

Methods

Venue

Sparse
W/ Detector

Sparse
Wo/ Detector

Dense

D2Net [67]
R2D2 [199]
DISK [255]
SuperGlue
NCNet [201]
DRC-Net [152]
LoFTR [234]
DKM [68]
PMatch (Ours)

CVPR’19
NeurIPS’19
NeurIPS’20
CVPR’19
ECCV’20
ICASSP’22
CVPR’21
CVPR’23
CVPR’23

Pose Estimation AUC ↑
@3px @5px @10px
53.6
35.9
23.2
76.8
63.9
50.6
78.9
64.9
52.3
81.7
68.3
53.9
67.1
54.2
48.9
68.3
56.2
50.6
75.6
65.9
84.6
88.5
80.6
71.3
88.5
80.7
71.9

Table 3.5 Hpatches Homography Estimation. We follow [234] in evaluation protocol. We report
the corner point AUC error under the estimated homography matrix. [Key: Red color marks Best,
Blue color marks the Second Best]

Baseline CFGM 𝐿𝐻

pMIM Encoder
(𝐸𝜃, 𝑇𝜃, 𝐿𝜃)

✓
✓
✓
✓
✓

✓
✓
✓
✓

✓
✓
✓

✓
✓

(𝑅𝜃)

pMIM Decoder Pose Estimation AUC ↑
@5◦ @10◦ @20◦
83.0
71.5
56.1
83.9
72.6
57.5
84.1
72.9
57.9
85.3
75.0
60.6
85.7
75.7
61.4

✓

Table 3.6 Ablation Studies on MegaDepth. The baseline method is the network in Fig. 6.3 with
only a LoFTR module, i.e., without the other components of CFGM. The ablation is conducted
under the same training and testing resolution as Tab. 3.2. Bold marks best.

multi-scale dense refinement improves fine-scale correspondence accuracy. In Row 2, (c), (d), and

(e), our CFGM and homography loss achieve accurate correspondence estimation on textureless

planar surface, e.g., the black wall behind the sofa.

Running Time Evaluated on an RTX 2080 Ti GPU, we run 160 ms for an image of 480 × 640

while LoFTR [234] runs 116 ms and DKM [68] runs 148 ms. Our model runs similarly compared

to the baselines. The running time comparison to other dense methods is in Tab. 3.1.

Benefit of the paired MIM pretraining Shown in Tab. 7.5, with the paired MIM pretext task, the

pose accuracy thresholded at 5◦ improves by 3.5% = 61.4% − 57.9%. A visual result of the paired

MIM task is shown in Fig. 3.4.

CFGM and Homography Loss The benefit of the proposed CFGM module and homography loss

𝐿 ℎ is included in Tab. 7.5. They help the network predict more accurate results in textureless planar

surfaces.

44

3.6 Conclusion

This work investigates the benefit of pretraining the encoder and decoder of a dense geometric

matching network under the paired MIM task. We solve the discrepancy between the pretraining

and finetuning tasks. Also, we contribute an improved geometric matching network by reducing

the ambiguity of textureless patches and augmenting the learning of local planar surfaces.

Limitation Our method does not produce robust local descriptors. When registering a keypoint,

our method needs to run dense matching over all past frames, imposing latency for time-sensitive

applications, e.g., odometry estimation.

45

CHAPTER 4

TAME A WILD CAMERA: IN-THE-WILD MONOCULAR CAMERA CALIBRATION

3D sensing for monocular in-the-wild images, e.g., depth estimation and 3D object detection,

has become increasingly important. However, the unknown intrinsic parameter hinders their

development and deployment. Previous methods for the monocular camera calibration rely on

specific 3D objects or strong geometry prior, such as using a checkerboard or imposing a Manhattan

World assumption. This work solves the problem from the other perspective by exploiting the

monocular 3D prior. Our method is assumption-free and calibrates the complete 4 Degree-of-

Freedom (DoF) intrinsic parameters. First, we demonstrate intrinsic is solved from two well-studied

monocular priors, i.e., monocular depthmap, and surface normal map. However, this solution

imposes a low-bias and low-variance requirement for depth estimation. Alternatively, we introduce

a novel monocular 3D prior, the incidence field, defined as the incidence rays between points in

3D space and pixels in the 2D imaging plane. The incidence field is a pixel-wise parametrization

of the intrinsic invariant to image cropping and resizing. With the estimated incidence field, a

robust RANSAC algorithm recovers intrinsic. We demonstrate the effectiveness of our method by

showing superior performance on synthetic and zero-shot testing datasets. Beyond calibration, we

demonstrate downstream applications in image manipulation detection & restoration, uncalibrated

two-view pose estimation, and 3D sensing.

4.1

Introduction

Camera calibration is typically the first step in numerous vision and robotics applications [99,

164] that involve 3D sensing. Classic methods enable accurate camera calibration by imaging

a specific 3D structure such as a checkerboard [192]. With the rapid growth of monocular 3D

vision, there is an increasing focus on 3D sensing for in-the-wild images, such as monocular depth

estimation, 3D object detection, and 3D reconstruction. While techniques of 3D sensing over

in-the-wild monocular images developed, camera calibration for such in-the-wild images continues

to pose significant challenges.

Classic methods for monocular calibration use strong geometry prior, such as using a checker-

46

Figure 4.1 In (a), our work focuses on monocular camera calibration for in-the-wild images. We
recover the intrinsic from monocular 3D-prior. In (c) - (e), an estimated depthmap is converted
to surface normal using a groundtruth and noisy intrinsic individually. Noisy intrinsic distorts
the point cloud, consequently leading to inaccurate surface normal. In (e), the normal presents a
different color to (d). Motivated by the observation, we develop a solver that utilizes the consistency
between the two to recover the intrinsic. However, the solution exhibits numerical instability. We
then propose to learn the incidence field as an alternative 3D monocular prior. The incidence field
is the collection of the pixel-wise incidence ray, which originates from a 3D point, targets at a 2D
pixel, and crosses the camera origin, as shown in (b). Similar to depthmap and normal, a noisy
intrinsic leads to a noisy incidence field, as in (e). By same motivation, we develop neural network
to learn in-the-wild incidence field and develop a RANSAC algorithm to recover intrinsic from the
estimated incidence field.

board. However, such 3D structures are not always available in in-the-wild images. As a solution,

alternative methods relax the assumptions. For example, [111] and [91] calibrate using common

objects such as human faces and objects’ 3D bounding boxes. Another significant line of re-

search [133, 208, 61, 10, 281, 293, 136] is based on the Manhattan World assumption [49], which

posits that all planes within a scene are either parallel or perpendicular to each other. This assump-

tion is further relaxed [284, 109, 121] to estimate the lines that are either parallel or perpendicular

to the direction of gravity. The intrinsic parameters are recovered by determining the intersected

vanishing points of detected lines, assuming a central focal point and an identical focal length.

While the assumptions are relaxed, they may still not hold true for in-the-wild images. This

creates a contradiction: although we enable robust models to estimate in-the-wild monocular

depthmap, generating its 3D point cloud remains infeasible due to the missing intrinsic. A similar

challenge arises in monocular 3D object detection, as we face limitations in projecting the detected

3D bounding boxes onto the 2D image. In AR/VR applications, the absence of intrinsic precludes

47

placing multiple reconstructed 3D objects within a canonical 3D space. The absence of a reliable,

assumption-free monocular intrinsic calibrator has become a bottleneck in deploying these 3D

sensing applications.

Our method is motivated by the consistency between the monocular depthmap and surface

normal map. In Fig. 7.1 (c) - (e), an incorrect intrinsic distorts the back-projected 3D point cloud

from the depthmap, resulting in distorted surface normals. Based on this, intrinsic is optimal

when the estimated monocular depthmap aligns consistently with the surface normal. We present a

solution to recover the complete 4 DoF intrinsic by leveraging the consistency between the surface

normal and depthmap. However, the algorithm is numerically ill-conditioned as its computation

depends on the accurate gradient of depthmap. This requires depthmap estimation with low bias

and variance.

To resolve it, we propose an alternative approach by introducing an additional novel 3D monoc-

ular prior in complementation to the depthmap and surface normal map. We refer to this as the

incidence field, which depicts the incidence ray between the observed 3D point and the projected

2D pixel on the imaging plane, as shown in Fig. 7.1 (b). The combination of the incidence field

and the monocular depthmap describes a 3D point cloud. Compared to the original solution, the

incidence field is a direct pixel-wise parameterization of the camera intrinsic. This implies that a

minimal solver based on the incidence field only needs to have low bias. We then utilize a deep

neural network to perform the incidence field estimation. A non-learning RANSAC algorithm is

developed to recover the intrinsic parameters from the estimated incidence field.

We consider the incidence field a monocular 3D prior. Similar to depthmap and surface normal,

the incidence field is invariant to the image cropping or resizing. This encourages its generalization

over in-the-wild images. To empirically support our argument, we collected multiple public datasets

into a comprehensive dataset with diverse indoor, outdoor, and object-centric images captured by

different imaging devices. We further boost the variety of intrinsic by resizing and cropping the

images in a similar manner as [91]. Finally, we include zero-shot testing samples to benchmark

real-world monocular camera calibration performance.

48

DoF
Assumption
Train Data

[317, 318, 138, 316, 313, 111, 91]
4
Specific-Objects
-

[133, 208, 281, 136]
1
Manhattan
-

[109, 138]
1

[121]
3

Manhattan-Train Manhattan-Train
Panorama Image Panorama Image Calibrated Image

Ours
4
None

Table 4.1 Camera Calibration Methods from Strong to Relaxed Assumptions. Non-learning
methods [317, 318, 138, 316, 313, 111, 91, 133, 208, 281, 136] rely on strong assumptions.
Learning based methods [109, 138, 121] relax the assumptions to training data. Our method makes
no assumptions in either training or testing. This enables training with any calibrated images while
[109, 138] consume panorama images. Despite that, we calibrate complete 4 DoF intrinsic.

We showcase downstream applications that benefit from monocular camera calibration. Despite

the aforementioned 3D sensing tasks, we present two intriguing additional applications. One is

detecting and restoring image resizing and cropping. When an image is cropped or resized, it

disrupts the assumption of a central focal point and identical focal length, leading to irregular

intrinsic. Using the estimated intrinsic parameters, we restore the edited image by adjusting its

intrinsic to a regularized form. The other application involves two-view uncalibrated camera pose

estimation. With established image correspondence, a fundamental matrix [100] is determined.

However, there does not exist an injective mapping between the fundamental matrix and camera

pose [99]. This raises a counter-intuitive fact: inferring the pose from two uncalibrated images is

infeasible. But our method enables uncalibrated two-view pose estimation via applying monocular

camera calibration.

We summarize our contributions as follows:

⋄ Our approach tackles monocular camera calibration from a novel perspective by relying on

monocular 3D priors. Our method makes no assumption for the to-be-calibrated image.

⋄ Our algorithm provides robust monocular intrinsic estimation for in-the-wild images, accompanied

by extensive benchmarking and comparisons against other baselines.

⋄ We demonstrate its benefits on additional intriguing diverse and novel downstream applications.

4.2 Related Works

Monocular Camera Calibration with Geometry. One line of work [133, 208, 61, 10, 281,

293, 136] assumes the Manhattan World assumption [49], where all planes in 3D space are either

parallel or perpendicular. Under the assumption, line segments in the image converge at the

49

vanishing points, from which the intrinsic is recovered. LSD [262] and EDLine [3] develop robust

line estimators. Others jointly estimate the horizon line and the vanishing points [311, 224, 141]. In

Tab. 4.1, recent learning-based methods [283, 284, 109, 138, 121] relax the assumption to training

data. They train the model using panorama images whose vanishing point and horizon lines are

known. Still, the assumption constrains [283, 284, 109, 138] in modeling intrinsic as 1 DoF camera.

Recently, [121] relaxes the assumption to 3 DoF via regressing the focal point. In comparison,

our method makes no assumption. This enables us to calibrate 4 DoF intrinsic and train with any

calibrated images.

Monocular Camera Calibration with Object. Zhang’s method [317] based on a checkerboard

pattern is widely regarded as the standard for camera calibration. Several works generalize this

method to other geometric patterns such as 1D objects [318], line segments [316], and spheres [313].

Recent works [111] and [91] extend camera calibration to real-world objects such as human faces.

Optimizers, including BPnP [37] and PnP [233] are developed. However, the usage of specific

objects restricts their applications. In contrast, our approach applies to any image.

Image Cropping and Resizing. Detecting content-based image manipulation [155] is exten-

sively researched. But few studies geometric manipulation, such as resizing and cropping. On

resizing, [56] regresses the image aspect ratio with a deep model. On cropping, a recent proactive

method [306] is developed. We demonstrate image calibration also addresses image geomet-

ric manipulation. Our method does not need to encrypt images, complementing content-based

manipulation detection.

Uncalibrated Two-View Pose Estimation. With the fundamental matrix estimated, the two-view

camera pose is determined up to a projective ambiguity if images are uncalibrated. Alternative

solutions [235, 125, 169] exist by employing deep networks to regress the pose. However, regression

hinders the usage of geometric constraints, which proves crucial in calibrated two-view pose

estimation [322, 240, 218]. Other work [101, 76] use more than two uncalibrated images for pose

estimation. Our work complements prior studies by enabling a minimum uncalibrated two-view

solution.

50

Figure 4.2 We illustrate the framework for the proposed monocular camera calibration algorithm.
In (a), a deep network maps the input image I to the incidence field V. A RANSAC algorithm
recovers intrinsic from V.
In (b), we visualize a single iteration of RANSAC. An intrinsic is
computed with two incidence vectors randomly sampled at red pixel locations. From Eq. (4.2),
an intrinsic determines the incidence vector at a given location. The optimal intrinsic maximizes
the consistency with the network prediction (blue and orange). Subfigure (c) details the RANSAC
algorithm. Different strategies are applied depending on if a simple camera is assumed. If not
assumed, we independently compute ( 𝑓𝑥, 𝑏𝑥) and ( 𝑓𝑦, 𝑏𝑦).
If assumed, there is only 1 DoF of
intrinsic. We proceed by enumerating the focal length within a predefined range to determine the
optimal value.

4.3 Method

In this section, we first show how to estimate intrinsic parameters by using monocular 3D priors,

such as the surface normal map and monocular depthmap. We then introduce the incidence field as

a new monocular 3D prior, which complements the surface normal map and monocular depthmap.

We describe the training strategy and the network used to learn the incidence field. After estimating

the incidence field, we present a RANSAC algorithm to recover the 4 DoF intrinsic parameters.

Lastly, we explore various feasible downstream applications of the proposed algorithms. As this

work focuses on studying intrinsic parameters in monocular images captured by modern imaging

devices, we ignore the estimation of skew, radial, or tangential distortion. Fig. 6.3 shows algorithm

framework.

4.3.1

Intrinsic Calibration from Monocular 3D Priors

Our method aims to use generalizable monocular 3D priors without assuming the 3D scene

geometry. Hence, we start with monocular depthmap D and surface normal map N. Assume

there exists a learnable mapping between the input image I, depthmap D, and normal map N:

D, N = D𝜃 (I), where D𝜃 can be a learned network. We denote the intrinsic Ksimple, K, and its

51

inverse K−1 as:

Ksimple =












𝑓

0

0 𝑤/2

𝑓

ℎ/2

0 0

1























𝑓𝑥

0

0

0

𝑏𝑥

𝑓𝑦 𝑏𝑦

0

1












0


1/ 𝑓𝑥










0

0

−𝑏𝑥/ 𝑓𝑥

1/ 𝑓𝑦 −𝑏𝑦/ 𝑓𝑦

0

1












, K =

, K−1 =

.

(4.1)

The notation Ksimple suggests a simple camera model with the identical focal length and central
focal point assumption. Given a 2D homogeneous pixel location p⊺ =

and its depth

𝑥

(cid:105)

(cid:104)

𝑦 1

value 𝑑 = D(p), the corresponding 3D point is defined as:

𝑋











where the vector v is an incidence ray, originating from the 3D point P, directed towards the 2D

𝑥−𝑏 𝑥
𝑓𝑥
𝑦−𝑏 𝑦
𝑓𝑦

= 𝑑 · K−1

= 𝑑 · v,

P = 𝑑 ·
























































= 𝑑 ·

(4.2)

𝑌

1

1

1

𝑥

𝑦

pixel p, and passing through the camera’s origin. The incidence field is determined by the collection

of incidence rays associated with each pixel, where v = V(p).

4.3.2

Intrinsic from Monocular 3D Prior Constraints

In this section, we explain how to determine the intrinsic matrix K using the estimated surface

normal map N and depthmap D. Given the estimated depth 𝑑 = D(p) and normal n = N(p) at 2D

pixel location p, a local 3D plane is described as:

n⊺ · 𝑑 · v + 𝑐 = 0.

By taking derivative in 𝑥-axis and 𝑦-axis directions, we have:

n⊺∇𝑥 (𝑑 · v) = 0,

n⊺∇𝑦 (𝑑 · v) = 0.

(4.3)

(4.4)

Note the bias 𝑏 of the 3D local plane is independent of the camera projection process. Without loss

of generality, we show the case of our method for 𝑥-direction. Expanding Eq. (4.4), we obtain:

𝑛1∇𝑥 (𝑑 ·

𝑥 − 𝑏𝑥
𝑓𝑥

) + 𝑛2

𝑦 − 𝑏𝑦
𝑓𝑦

∇𝑥 (𝑑) + 𝑛3∇𝑥 (𝑑) = 0,

(4.5)

where ∇𝑥 (𝑑) represents the gradient of the depthmap D in the 𝑥-axis and can be computed, for

example, using a Sobel filter [129]. Next, re-parametrize the unknowns in Eq. (4.5) to get:

𝑎1 𝑓𝑥 𝑓𝑦 + 𝑎2 𝑓𝑥 𝑏𝑦 + 𝑎3 𝑓𝑦𝑏𝑥 + 𝑎4 𝑓𝑦 + 𝑎5 𝑓𝑥 = 0.

(4.6)

52

Figure 4.3 In (a) and (b), we highlight the ground truth depthmap of a smooth surface, such as
a table’s side. Even with the ground truth depthmap, the resulting surface normals exhibit noise
patterns due to the inherent high variance. This makes the intrinsic solver based on the consistency
of the depthmap and surface normals numerically unstable. Further, (a)-(d) demonstrate a scaling
and cropping operation applied to each modality. In (c), the intrinsic changes per operation, leading
to ambiguity if a network directly regresses the intrinsic values. Meanwhile, the FoV is undefined
after cropping. In comparison, the incidence field remains invariant to image editing, same as the
surface normal and depthmap.

Divide both sides of the equation by 𝑓𝑥 to get:

𝑎1 𝑓𝑦 + 𝑎2𝑏𝑦 + 𝑎3𝑟𝑏𝑥 + 𝑎4𝑟 + 𝑎5 = 0,

(4.7)

where 𝑟 =

𝑓𝑦
𝑓𝑥 . By stacking Eq. (4.7) with 𝑁 ≥ 4 randomly sampled pixels, we acquire a linear

system:

A𝑁×4 X4×1 = B4×1,

⊺
4×1 =
where the intrinsic parameter to be solved is stored in a vector X

(cid:104)

𝑓𝑦 𝑏𝑦

𝑟𝑏𝑥

𝑟

solves the other intrinsic parameters as:

𝑓𝑦 = 𝑓𝑦, 𝑏𝑦 = 𝑏𝑦, 𝑓𝑥 =

𝑓𝑦
𝑟

, 𝑏𝑥 =

𝑟𝑏𝑥
𝑟

.

(4.8)

(cid:105) ⊺

. This

(4.9)

The known constants are stored in matrix A𝑁×4 and B4×1. If we choose 𝑁 = 4 in Eq. (4.8), we

obtain a minimal solver where the solution X is computed by performing Gauss-Jordan Elimination.

Conversely, when 𝑁 > 4, the linear system is over-determined, and X is obtained using a least

squares solver. The above suggests the intrinsic is recoverable from the monocular 3D prior.

53

4.3.3

Incidence Field as Monocular 3D Prior

Eq. (4.9) relies on the consistency between the surface normal and depthmap gradient, which

may require a low-variance depthmap estimate. From Fig. 4.3, even groundtruth depthmap leads

to spurious normal due to its inherent high variance. Minimal solver in Eq. (4.9) can lead to a poor

solution.

As a solution, we propose to directly learn the incidence field V as a monocular 3D prior. In

Eq. (4.2) and Fig. 7.1, the combination of the incidence field V and the monocular depthmap D

creates a 3D point cloud. In Eq. (4.3), the incidence field V can measure the observation angle

between a 3D plane and the camera. Similar to depthmap D and surface normal map N, the

incidence field V is invariant to the image cropping and resizing. Consider an image cropping and

resizing described as:

x′ = ΔK x, ΔK =

, K′ = ΔKK,

(4.10)

Δ 𝑓𝑥

0

Δ𝑐𝑥

0

0

Δ 𝑓𝑦 Δ𝑐𝑦

0

1























where K′ is the intrinsic after transformation. The surface normal map N and depthmap D after

transformation is defined as:

N′(x′) = N(x) = N(ΔK−1x′), D′(x′) = D(x) = D(ΔK−1x′).

(4.11)

Similarly, the incidence field after transformation is:

V′(x′) = (K′)−1x′ = K−1(ΔK)−1 x′ = K−1 x = v = V(x).

(4.12)

Eq. (4.12) suggests that the incidence field V is a parameterization of the intrinsic matrix that is

invariant to image resizing and image cropping. Other invariant parameterizations of the intrinsic

matrix, such as the camera field of view (FoV), rely on the central focal point assumption and only

cover a 2 DoF intrinsic matrix. An illustration is put in Fig. 4.3.

4.3.4 Learn Monocular Incidence Field

Given the strong connection between the monocular depthmap D and camera incidence field

V, we adopt NewCRFs [309], a neural network used for monocular depth estimation, for incidence

54

field estimation. We change the last output head to output a three-dimensional normalized incidence

field (cid:101)V with the same resolution as the input image I. We adopt a cosine similarity loss defined as:

(cid:101)V = D𝜃 (I),

𝐿 =

1
𝑁

𝑁
∑︁

𝑖=1

(cid:101)V⊺ (x𝑖)(cid:101)Vgt(x𝑖).

(4.13)

We normalize the last dimension of the incidence field to one before feeding to the RANSAC

algorithm. That is to say, V⊺ (x𝑖) =

(cid:104)

˜𝑣1/ ˜𝑣3

˜𝑣2/ ˜𝑣3 1

(cid:105) ⊺

(cid:104)

=

𝑣1 𝑣2 1

(cid:105) ⊺

.

4.3.5

Intrinsic from Monocular Incidence Field

Since the network inference executes on GPU device, we adopt a GPU-end RANSAC algorithm

to recover the intrinsic K from the incidence field V. Unlike a CPU-based RANSAC, we perform

fixed 𝐾𝑟 iterations of RANSAC without termination. In RANSAC, we use the minimal solver to

generate 𝐾𝑐 candidates and select the optimal one that maximizes a scoring function (see Fig. 6.3).

RANSAC w.o Assumption. From Eq. (4.2), the incidence vector v relates to the intrinsic K as:

v⊺ = K−1x =

(cid:104) 𝑥−𝑏 𝑥
𝑓𝑥

𝑦−𝑏 𝑦
𝑓𝑦

(cid:105) ⊺

.

1

(4.14)

From Eq. (4.14), a minimal solver for intrinsic is straightforward. In the incidence field, randomly

sample two incidence vectors (v1)⊺ =

(cid:104)

𝑣1
𝑥

(cid:105) ⊺

𝑣1

𝑦 1

and (v2)⊺ =

(cid:104)

𝑣2
𝑥

(cid:105) ⊺

𝑣2

𝑦 1

. The intrinsic is:

𝑓𝑥 = 𝑥1−𝑥2
𝑥−𝑣2
𝑣1
𝑥

𝑏𝑥 = 1

2 (𝑥1 − 𝑣1

𝑥 𝑓𝑥 + 𝑥2 − 𝑣2

𝑥 𝑓𝑥)





,





𝑓𝑦 =

𝑦1−𝑦2
𝑦−𝑣2
𝑣1
𝑦

.

(4.15)

𝑏𝑦 = 1

2 (𝑥1 − 𝑣1

𝑦 𝑓𝑦 + 𝑥2 − 𝑣2

𝑦 𝑓𝑦)

Similarly, the scoring function is defined in 𝑥-axis and 𝑦-axis, respectively:

𝜌𝑥 ( 𝑓𝑥, 𝑏𝑥, {x}, {v}) =

𝑁𝑘∑︁

(cid:18)

𝑖=1

𝑥𝑖 − 𝑏𝑥
𝑓𝑥

∥

− 𝑣𝑖

𝑥 ∥ < 𝑘𝑥

(cid:19)

,

𝜌𝑦 ( 𝑓𝑦, 𝑏𝑦, {x}, {v}) =

𝑁𝑘∑︁

(cid:18)

𝑖=1

𝑦𝑖 − 𝑏𝑦
𝑓𝑦

∥

− 𝑣𝑖

𝑦 ∥ < 𝑘 𝑦

(cid:19)

.

(4.16)

RANSAC w/ Assumption.

If a simple camera model is assumed, i.e., intrinsic only has an

unknown focal length, it only needs to estimate 1-DoF intrinsic. We enumerate the focal length

candidates as:

{ 𝑓 } = { 𝑓min +

𝑖
𝑁 𝑓

( 𝑓max − 𝑓min) | 0 ≤ 𝑖 ≤ 𝑁 𝑓 }.

(4.17)

55

Perspective [121]

𝑒 𝑓

Dataset

Calibration

Scene

ZS Syn.

✗
NuScenes [33]
Calibrated
Driving
✗
KITTI [83]
Calibrated
Driving
✗
Cityscapes [48]
Calibrated
Driving
✗
Calibrated
NYUv2 [222]
Indoor
✗
ARKitScenes [13] Calibrated
Indoor
✗
Calibrated
SUN3D [289]
Indoor
✗
SfM
MVImgNet [307]
Object
✗
Label
Objectron [2]
Object
✗
SfM
MegaDepth [146]
Outdoor
Driving ✔
Calibrated
Waymo [235]
✔
Pre-defined
RGBD [232]
Indoor
✔
Calibrated
ScanNet [52]
Indoor
Hybrid ✔
Pre-defined
MVS [80]
Pre-defined Synthetic ✔
Scenes11 [36]

✔ 0.610
✔ 0.670
✔ 0.713
✔ 0.449
✔ 0.362
✔ 0.442
✔ 0.204
✔ 0.178
✗
0.493
✗
0.564
✗
0.264
✗
0.385
✗
0.312
✗
0.348

𝑒𝑏
0.248
0.221
0.334
0.409
0.410
0.501
0.500
0.339
0.000
0.020
0.000
0.010
0.000
0.000

Ours

𝑒𝑏

𝑒 𝑓

0.102 0.087
0.111 0.078
0.108 0.110
0.086 0.174
0.140 0.243
0.113 0.205
0.101 0.081
0.078 0.070
0.137 0.046
0.210 0.053
0.097 0.039
0.128 0.041
0.170 0.028
0.170 0.044

Ours + Asm.
𝑒𝑏
𝑒 𝑓

0.402 0.400
0.383 0.368
0.387 0.367
0.376 0.379
0.400 0.377
0.389 0.383
0.108 0.072
0.088 0.079
0.109 0.000
0.157 0.020
0.067 0.000
0.109 0.010
0.127 0.000
0.117 0.000

Table 4.2 In-the-Wild Monocular Camera Calibration. We benchmark in-the-wild monocular
camera calibration performance. On the training dataset except MegaDepth, we synthesize novel
intrinsic by cropping and resizing. Note the synthesized images violate the focal point and focal
length assumption. [Key: ZS = Zero-Shot, Asm. = Assumptions, Syn. = Synthesized]

The scoring function under the scenario is defined as the summation over 𝑥-axis and 𝑦-axis:

𝜌( 𝑓 , {x}, {v}) = 𝜌𝑥 ( 𝑓𝑥, 𝑤/2, {x}, {v}) + 𝜌𝑦 ( 𝑓𝑦, ℎ/2, {x}, {v}).

(4.18)

4.3.6 Downstream Applications

Image Crop & Resize Detection and Restoration. Eq. (4.10) defines a crop and resize operation:

x′ = ΔK x, K′ = ΔK K,

I′(x′) = I′(ΔK x) = I(x).

(4.19)

When a modified image I′ is presented, our algorithm calibrates its intrinsic K′ and then:

Case 1: The original intrinsic K is known. E.g., determine K with the camera type through the

image-associated EXIF file [4]. Image manipulation is computed as ΔK = K′ K−1. A manipulation

is detected if ΔK deviates from an identity matrix. The original image restores as I(x) = I′(ΔK x).

Interestingly, the four corners of image I′ are mapped to a bounding box in original image I under

manipulation ΔK. We thus quantify the restoration by measuring the bounding box. See Fig. 6.7.

Case 2: The original intrinsic K is unknown. We assume the genuine image possess an identical

focal length and central focal point. Any resizing and cropping are detected when matrix K′ breaks

this assumption. Note, the rule can not detect aspect ratio preserving resize and centered crop. We

restore the original image by defining an inverse operation ΔK restore K′ to an intrinsic fits the

assumption.

56

FoV (◦) Upright [136]

Mean
Median

PAMI’13
9.47
4.42

erceptual [109] CTRL-C [138] Perspective [121]
ICCV’21
3.59
2.72

CVPR’23
3.07
2.33

CVPR’18
4.37
3.58

Ours
w/o Asm. w/ Asm.

2.49
1.96

2.47
1.92

Table 4.3 Comparisons to Monocular Camera Calibration with Geometry on GSV dataset [5].
We follow the training and testing protocol of [138]. For a fair comparison, we convert the estimated
intrinsic to camera FoV on the 𝑦-axis direction, following [138, 121], and report our results w/ and
w/o the assumptions.

3D Sensing Related Tasks. With intrinsic estimated, multiple applications become available for

in-the-wild images. E.g., depthmap to point cloud, uncalibrated two-view pose estimation, and etc.

4.4 Experiments

4.4.1 Monocular Camera Calibration In-The-Wild

Datasets. Our method is trained whenever a calibrated intrinsic is provided, making it applicable

to a wide range of publicly available datasets. In Tab. 4.2, we incorporate datasets of different

application scenarios, including indoor, outdoor scenes, driving, and object-centric scenes. Dataset

MVS [80] is a hybrid dataset involved with indoor, outdoor, and object-centric images. Many of the

datasets utilize only a single type of camera for data collection, resulting in a scarcity of intrinsic

variations. Similar to [138], we employ random resizing and cropping to synthesize more intrinsic,

marked in Tab. 4.2 column “Syn.”. In augmentation, we first resize all images to a resolution of

480 × 640. We then uniformly random resize up to two times its size and subsequently crop to a

resolution of 480 × 640. As MegaDepth [146] collects images captured by various cameras from

the Internet, we disable its augmentation. We document the intrinsic parameters of each dataset in

Supp.

In Tab. 4.2 column “Calibration", we assess intrinsic quality into various levels. “Calibrated”

suggests accurate calibration with a checkerboard. “Pre-defined” is less accurate, indicating the

default intrinsic provided by the camera manufacturer without a calibration process. “SfM" signifies

that the intrinsic is computed via an SfM method [209]. “Labeled” means the intrinsic manually

labeled by a human.

In-The-Wild Monocular Camera Calibration. We benchmark in-the-wild monocular calibration

performance on Tab. 4.2. For trained datasets, except for MegaDepth, we test on synthetic data

57

Methods

Louraki [158]
Fetzer [76] WACV’20
BPnP [37] CVPR’19
FaceCalib [111] FG’23
Ours
Ours + Assumptions

𝑒 𝑓
0.662
0.845
0.675
0.133
0.034
0.019

BIWIRGBD-ID [188]

𝑒𝑏
-
-
-
-

𝑒 𝑓𝑥
𝑒 𝑓𝑦
0.662
0.662
0.845
0.845
0.675
0.675
0.133
0.133
0.029 0.016 0.020
0.019 0.000
0.019

𝑒𝑏 𝑥
𝑒𝑏 𝑦
0.387 0.222
0.005
0.001
0.322
0.479
0.042
0.026
0.011 0.018
0.000 0.000

CAD-120 [236]

𝑒 𝑓𝑥
𝑒 𝑓
𝑒 𝑓𝑦
𝑒𝑏
0.732
0.732
0.732
-
0.679
0.679 0.679
-
1.178
1.178 1.178
-
0.151
0.151 0.151
-
0.137
0.042
0.137 0.054
0.047 0.047 0.047 0.000

𝑒𝑏 𝑥
𝑒𝑏 𝑦
0.255 0.180
0.001 0.005
0.103 0.129
0.023 0.063
0.008
0.042
0.000
0.000

Table 4.4 Comparisons to Monocular Camera Calibration with Object. We compare to the
recent FaceCalib [111], which calibrates the camera using video containing human faces. We
report our results w/ and w/o assuming a simple camera model. We perform zero-shot prediction
without training using Tab. 4.2 model.

using random cropping and resizing. For the unseen test dataset, we refrain from applying any

augmentation to better mimic real-world application scenarios.

We compare to the recent baseline [121], which regresses intrinsic via a deep network. Note,

[121] can not train on arbitrary calibrated images as requiring panorama images in training. A

fair comparison using the same training and testing images is in Tab. 4.4 and Sec. 4.4.2. [121]

provides models with two variations: one assumes a central focal point, and another does not. We

report with the former model whenever the input image fits the assumption. From Tab. 4.2, our

method demonstrates superior generalization across multiple unseen datasets. Further, the result w/

assumption outperforms w/o assumption whenever the input images fit the assumption. Tested on

an RTX-2080 Ti GPU, the combined network inference and calibration algorithm runs on average

in 87 ms.

4.4.2 Monocular Camera Calibration with Geometry

Methods in this line of research hold a Manhattan World assumption, positing that images consist

of planes that are either parallel or perpendicular to each other. Stated in Tab. 4.1, baselines [109,

138, 121] relax the assumption to training data. Our method imposes no assumption in both training

and testing.

This brings three benefits. First, the assumption restricts their training to panorama images. In

contrast, our model is trainable with any calibrated images. This yields improved generalization,

as shown in Tab. 4.2. Second, it constrains the baselines to a simple camera parameterized by

FoV. We consider the proposed incidence field a more generalizable and invariant parameterization

58

Input & Inci-

(a)
dence

(b) Est. Restoration (c) GT. Restoration (d) Original Image

Figure 4.4 Image Crop & Resize Detection and Restoration. Image editing, including cropping
and resizing changes intrinsic. As in Sec. 4.3.6, monocular calibration is applicable to detect and
restore image manipulations. We visualize the zero-shot samples on ScanNet and Waymo. More
examples are in Supp.

of intrinsic. E.g., while FoV remains invariant to image resizing, it still changes after cropping.

However, the incidence field is unaffected in both cases. In Tab. 4.4, the substantial improvement

we achieved (0.60 = 3.07 − 2.47) over the recent SoTA [121] empirically supports our argument.

Third, our method calibrates the 4 DoF intrinsic with a non-learning RANSAC algorithm. Baselines

instead regress the intrinsic. This renders our method more robust and interpretable. In Fig. 6.3

(b), the estimated intrinsic quality is visually discerned through the consistency achieved between

the two incidence fields.

4.4.3 Monocular Camera Calibration with Objects

We compare to the recent object-based camera calibration method FaceCalib [111]. The baseline

employs a face alignment model to calibrate the intrinsic over a video. Both [111] and ours perform

zero-shot prediction. We report performance using Tab. 4.2 model. Compared to [111], our

method is more general as it does not assume a human face present in the image. Meanwhile, [111]

calibrates over a video, while ours is a monocular method. For a fair comparison, we report the

video-based results as an averaged error over the videos. We report results w/ and w/o assuming

a simple camera model. Since the tested image has a central focal point, when the assumption

applied, the error of the focal point diminished. In Tab. 4.4, we outperform SoTA substantially.

The error metrics are in Supp.

59

Methods

Baseline
Ours

KITTI [83]

NYUv2 [222] ARKitScenes [13] Waymo [235]

RGBD [232]

ScanNet [52]

MVS [80]

mIOU Acc mIOU Acc mIOU
0.586
0.621
0.686
0.691
0.779
0.842

0.710
0.856

0.795
0.852

Acc
0.519
0.837

mIOU Acc mIOU Acc mIOU Acc mIOU Acc
0.667
0.581
0.795
0.681

0.721
0.796

0.811
0.887

0.595
0.638

0.636
0.693

0.597
0.709

0.681
0.781

Table 4.5 Image Crop and Resize Restoration. Stated in Sec. 4.3.6, our method also encompasses
the restoration of image manipulations. Use model reported in Tab. 4.2, we conduct evaluations on
both seen and unseen datasets.

4.4.4 Downstream Applications

Image Crop & Resize Detection and Restoration. Content-based image manipulation detection

and restoration [16, 154] is extensively studied. However, few explore geometric manipulation,

including resizing and cropping.

In Sec. 4.3.6, our method also addresses the detection and

restoration of geometric manipulations in images. Using the model reported in Tab. 4.2, we

benchmark its performance in Tab. 4.5. Random manipulations following Sec. 4.4.1 contribute to

50% of both train and test sets, and the other 50% are genuine images. In Tab. 4.5, we evaluate

restoration with mIOU and report detection accuracy (i.e. binary classification of genuine vs edited

images). From the table, our method generalizes to the unseen dataset, achieving an averaged

mIOU of 0.680. Meanwhile, we substantially outperform the baseline, which directly regresses

the intrinsic. The ablation suggests the benefit of the incidence field as an invariant intrinsic

parameterization. Beyond performance, our algorithm is interpretable. In Fig. 6.7, the perceived

image geometry is interpretable for humans.

Uncalibrated Two-View Camera Pose Estimation. With correspondence between two images,

one can infer the fundamental matrix [100]. However, the pose between two uncalibrated images

is determined by a projective ambiguity. Our method eliminates the ambiguity with monocular

camera calibration.

In Tab. 4.6, we benchmark the uncalibrated two-view pose estimation and

compare it to recent baselines. The result is reported using Tab. 4.2 model by assuming unique

intrinsic for both images. We perform zero-shot in ScanNet. For MegaDepth, it includes images

collected over the Internet with diverse intrinsics.

Interestingly, in ScanNet, our uncalibrated

method outperforms a calibrated one [152]. In Supp, we plot the curve between pose performance

and intrinsic quality. The challenging setting suggests itself an ideal task to evaluate the intrinsic

quality.

60

Methods

Calibrated

ScanNet [52]

MegaDepth [146]

SuperGlue [204] CVPR’19
DRC-Net [152] ICASSP’22
LoFTR [234] CVPR’21
ASpanFormer [38] ECCV’22
PMatch [331] CVPR’23
PMatch [331] CVPR’23

✔
✔
✔
✔
✔
✗

@5◦ @10◦ @20◦ @5◦ @10◦ @20◦
75.9
16.2
58.3
7.7
81.2
22.0
83.1
25.6
85.7
29.4
47.4
11.4

42.2
27.0
52.8
55.3
61.4
16.8

51.8
30.5
57.6
63.3
67.4
49.4

61.2
42.9
69.2
71.5
75.7
30.6

33.8
17.9
40.8
46.0
50.1
29.8

Table 4.6 Uncalibrated Two-View Camera Pose Estimation. We use the model reported in
Tab. 4.2 and assume distinct camera models for both frames. During calibration, we apply the
simple camera assumption. The last two rows ablate the performance using GT intrinsic and our
estimated intrinsic.

4.5 Conclusion

We calibrate monocular images through a novel monocular 3D prior referred as incidence field.

The incidence field is a pixel-wise parameterization of intrinsic invariant to image resizing and

cropping. A RANSAC algorithm is developed to recover intrinsic from the incidence field. We

extensively benchmark our algorithm and demonstrate robust in-the-wild performance. Beyond

calibration, we show multiple downstream applications that benefit from our method.

Limitation.

In real application, whether to apply the assumption still waits human input.

Broader Impacts. We do not anticipate any potential negative social impact arising from this

work.

61

CHAPTER 5

LIGHTEDDEPTH: VIDEO DEPTH ESTIMATION IN LIGHT OF LIMITED
INFERENCE VIEW ANGLES

Video depth estimation infers the dense scene depth from immediate neighboring video frames.

While recent works consider it a simplified structure-from-motion (SfM) problem, it still differs

from the SfM in that significantly fewer view angels are available in inference. This setting,

however, suits the mono-depth and optical flow estimation. This observation motivates us to

decouple the video depth estimation into two components, a normalized pose estimation over a

flowmap and a logged residual depth estimation over a mono-depth map. The two parts are unified

with an efficient off-the-shelf scale alignment algorithm. Additionally, we stabilize the indoor two-

view pose estimation by including additional projection constraints and ensuring sufficient camera

translation. Though a two-view algorithm, we validate the benefit of the decoupling with the

substantial performance improvement over multi-view iterative prior works on indoor and outdoor

datasets.

5.1

Introduction

Depth estimation is a fundamental task for applications such as 3D reconstruction [31],

robotics [132], and autonomous driving [310]. The depth is self-contained in the scene mo-

tion brought by the camera movement. The classic SfM methods [157, 209, 196, 287, 85] hence

jointly recover the scene depth and camera poses by applying bundle-adjustment over the entire

video sequence. However, the iterative optimization defined over all frames makes SfM a compu-

tationally intensive method. Video depth estimation simplifies the computation by only consuming

the immediate neighboring frames. In consequence, only limited camera view angles are available,

as shown in Fig. 7.1 (a).

The limited camera views, however, suit optical flow and monocular depth estimation. We

are then motivated to connect video depth to mono-depth and flow estimation by decoupling the

video-depth into two components. First, we use the flowmap to estimate a normalized up-to-scale

camera pose, i.e., camera pose with a unit-length translation vector. Second, we estimate video

62

depth as a logged residual over the mono-depthmap. The two components are unified by an efficient

off-the-shelf camera scale alignment algorithm, aligning the depthmap and flowmap, making the

residual depth estimation a stereo matching.

Unlike our method, most prior video depth estimation works [280, 240, 258, 268, 291] formulate

their solutions as deep SfM, shown in Fig. 7.1 (b). They can be grouped into two types [268]. Type I

methods [280, 240, 258] execute SfM within a fixed frame window, embedding bundle-adjustment

as a differentiable module within a network. Type II methods [268, 291] execute a consecutive-

frame SfM. They sequentially estimate an up-to-scale pose and an up-to-scale depthmap. While

prior works solve video depth estimation as a simplified SfM problem, our method differs in

decoupling the video depth estimation to two sub-tasks which are robust to deficient camera views,

i.e., flow based normalized pose estimation and logged residual depth estimation.

On pose estimation, we compare the optical flow with the projection flow computed from the

pose and depthmap, using the State-of-The-Art (SoTA) methods of each side, i.e., DeepV2D [240]

and RAFT [241]. The results in Supp.Tab. 1 show that the optical flow is more robust than the

projection flow. Since the flow performance is a bottleneck for pose performance, this suggests,

instead of optimizing poses by bundle-adjustment together with the depthmap as the type I method,

directly estimating the pose from flowmap can be more accurate, as the noise inside the depthmap

is avoided. We follow [322] in using the five-point algorithm [143] with RANSAC [77] to estimate

the normalized pose.

On video depth estimation, we treat it as a log space residual estimation over the monocular

initialization. While prior works [280, 240, 258] already adopt mono-depthmap as initialization,

the connection between monocular and video depth is under-explored. Prior works simply repeat

the video depth estimation after updating the pose. Specifically, they estimate the video depth by a

3D cost volume constructed by sampling the next frame feature map at different projected locations

specified by pre-defined depth candidates. Instead, we change the sampling from fixed candidates

to fixed log space residual candidates. This brings three benefits: (1) It enables the video depth

to benefit from SoTA monocular depth. (2) It improves the sampling efficiency in constructing the

63

BTS[137]

DeepMLE[58]
SfMR[268]
DRO[94]

DeepV2D[240]
DeepV2cD[110]
MaGNet[6]

Ours

Figure 5.1 Video Depth Performance Comparison on KITTI Dataset. We mark the methods taking
different numbers of frames with different colors. We propose a two-view video depth estimation
method that substantially outperforms prior two-view, three-view, and five-view methods. Our
method uses a monocular depth as initialization. The arrow marks our improvement when using
the BTS [137] as the initialization. Comparison is detailed in Tab. 5.1.

cost volume, as candidates are drawn dynamically, centering around the initial guess rather than

fixed. (3) It provides a reliable lower-bound depth performance for moving foreground objects and

static frames.

The residual video depth estimation is stereo matching via an estimated pose. Yet, we only

estimate the normalized pose, still lacking the baseline. We then propose an efficient voting based

scale alignment algorithm, estimating the camera scale by aligning the monocular depthmap with

flowmap. This algorithm connects the two decoupled sub-tasks: the normalized pose and residual

depth estimation.

Empirically, we find that the five-point algorithm runs less accurately in indoor scenarios. This

is because indoor videos are taken by hand-held cameras, possessing much more rotation movement

than outdoor videos taken by car-mounted cameras. The additional rotation movement weakens

the epipolar constraint, which is required by the five-point algorithm. To tackle the issue, during

each RANSAC consensus checking, we perform the scale alignment algorithm, turning normalized

camera pose to metric space pose. Then, we include an additional projection constraint to the

original epipolar constraint. It improves both indoor depth and pose performance.

We estimate the camera scale from the mono-depth instead of video depthmap. Ideally, similar

to residual depth learning, we may use an additional cost volume based decoder to learn the

residual camera scale. However, we show that under robust pose and flow estimate, the camera

scale learning loss can be converted to a relaxed depth learning loss, as the two only differ by

64

(a) Limited view angles of video depth

(b) Prior Multi-View

(c) Ours Two-
View

Figure 5.2 (a) Unlike classic SfM, video depth estimation possesses significantly fewer view angles
during inference. (b) Prior multi-view video depth estimation works [240, 237, 258] mimic SfM
pipeline, focusing on improving deep bundle-adjustment. (c) Considering the SfM alike pipelines
are compromised by the limited view angles, we base the video depth estimation on two deficient
view robust sub-tasks, i.e., the relative camera pose estimation based on the flowmap, and the
logged residual video depth estimation based on the monocular depthmap. The two sub-tasks are
connected by a novel and efficient scale alignment algorithm. We skip RGB inputs for simplicity
in (c).

a constant in log space. This reduces camera scale learning to depth learning. Empirically and

theoretically, we show that a single decoder is sufficient for both residual depth and camera scale

learning.

We summarize the contributions of our work as follows:

• We propose a comprehensive two-view video depth estimation method. Unlike a simplified SfM,

we decompose into two sub-tasks that are robust to deficient view angles, and connect them via an

efficient scale alignment algorithm.

• We stabilize the indoor normalized pose estimation with the additional projection constraint.

• Theoretically and empirically, we prove the equality between scale and video depth learning.

• On KITTI [84] and NYUv2 [178] datasets, our two-view sequential method reduces 56.5% and

34.1% error on the metric 𝛿 < 1.25 of video depth estimation over SoTA multi-view iterative

work [240].

5.2 Prior Works

5.2.1 Pose and Depth from Multi-View System

Structure-from-motion (SfM) [157, 209, 196, 287, 85] is the classic approach to recover scene

geometry and camera motion from video. After proper initialization, the pose and 3D points

65

Figure 5.3 Our algorithm takes two RGB inputs (I𝑚, I𝑛), the initial mono-depth D∗, and flowmap
O as inputs. Our proposed framework consists of 2 key steps: (1) An improved five-point algo-
rithm. Given flowmap O and mono-depth map D, apply consensus check over randomly initiated
normalized pose set P and its corresponding scale set S. (2) Residual video depth estimation with
a cost volume network. Between the two steps, we perform key-frame search if under insufficient
camera translation, i.e., re-estimate flowmap and pose with the next frame. Scale set S estimation
and video depth D† estimation are further detailed in Fig. 5.4 and 5.6.

are finetuned by bundle-adjustment over the input point correspondences. Visual simultaneously

localization and mapping (vSLAM) methods [230, 244, 74, 73, 174, 179, 242, 287] are similar to

SfM but focus on odometry.

Video depth estimation is the other multi-view system.

It contrasts to SfM as operating on

fixed frame windows, providing limited camera views. Recent works [240, 280, 94, 237, 75, 323,

258, 110] solve video depth estimation as an SfM problem. Inspired by classic SfM, they propose

different deep bundle-adjustment modules, minimizing a residual term during the network inference.

For instance, [280] and [240] separately propose a first-order and second-order deep optimization

scheme. [280] applies an exhaustive search over a local region in the pose parameter space. Given

the projection flow computed by the current depth and pose, [240] employs a motion module to

estimate a residual flow term. The pose is refined via applying a Gauss-Newton update [285].

Surprisingly, compared to estimating residual pose in inference, none of the prior works estimate

residual depth.

Our work solves the video depth estimation from the other perspective. Instead of emphasizing

the improved deep bundle-adjustment module, we decompose the video depth into sub-tasks that

are robust to narrow view angles. Our work can benefit other multi-view methods via serving as

their two-view initialization module [240, 280].

5.2.2 Deep Two-View Structure-from-Motion

SfMR [268] revisits the classic two-view SfM [57, 130] with deep learning. They first solve a

normalized pose from the input flowmap and then estimate a normalized depthmap, i.e., depthmap

66

divided by the camera scale.

Our method improves [268] in multiple perspectives. First, we validate that the optical flow is

more robust than the projection flow between immediate frames (detailed in Supp.Tab. 1.). This

completes the motivation of estimating normalized pose from the flowmap instead of applying deep

bundle-adjustment. In comparison, [268] only discusses its improvement over classic SIFT [159]

based two-view SfM. Second, we improve indoor pose estimation performance by including the

additional projection constraint. Third, the normalized depth in [268] is poorly ranged, varying

from zero to infinity, while the proposed logged residual depth is well ranged. As a result, our

model with 32 depth candidates outperforms [268] with 128 depth candidates. Fourth, our method

does not require groundtruth pose to produce normalized depth. The normalized pose and camera

scale are learned from synthetic flow and groundtruth depth labels, avoiding the noise from the

IMU or GPS device.

5.2.3 Multi-View-Stereo

With the optimized camera poses, video depth estimation is treated as a multi-view-stereo

(MVS) problem. Similar to SfM, most MVS methods [266, 55, 298, 299, 265, 156] assume

sufficient view variations, estimating without an init mono-depthmap. A concurrent MVS work [6],

however, positions itself to infer depth within a limited frame window. [6] skips the non-trivial pose

estimation and models depth as a Gaussian distribution. The video depth is estimated by selecting

the residual that max-a-posteriori. However, unlike us, they do not align depthmap with the camera

pose scale, lacking geometric constraint. In return, though [6] uses groundtruth poses and more

frames, we still outperform this iterative method, as in Tab. 5.1.

5.3 Proposed Method

Our objective is to jointly solve the interdependent pose and depth given two video frames.

Take the process of reconstructing image I𝑚 at frame 𝑚 from image I𝑛 at frame 𝑛 under a depthmap

D and pose P as I∗

𝑚 = 𝑔 ( 𝑓 (D, P) , I𝑛), where I∗

𝑚 is the reconstructed image.

𝑓 (·) produces 2D

projection locations in I𝑛, as a function of D, P, and the intrinsic matrix K (skipped in 𝑓 (·) for

simplicity). 𝑔(·) applies bilinear sampling to I𝑛 at 2D locations from 𝑓 (D, P). Formally, we aim

67

(a) Pixel-wise scale estimation

(b) Camera scale estimation

Figure 5.4 We randomly sample 𝑁𝑘 pixels {p} on frame I𝑚, marked in orange. Corresponded
frame I𝑛’s pixels {q} are determined by flowmap O. Sampled depth is {𝑑}. We illustrate: (a) Due
to the noise, corresponded pixel q does not comply projective geometry, i.e., q resides outside the
epipolar line lp. In Eqn. 5.6, we approximate the scale determined by pixel q with two pixels q𝑥 and
𝑟
q𝑦, residing horizontally and vertically on epipolar line lp. (b) One normalized pose P
is initiated
by five-point algorithm. Next, with Eqn. 5.7, we acquire a pixel-wise scale set s𝑟. After producing
the 𝐵-dim histogram of scale set s𝑟, the optimal scale 𝑠𝑟 is determined by majority voting.

to compute the depth D† and pose P† by optimizing the photometric constraint:

P†, D† = arg min

P,D

ℎ 𝑝 (𝑔 ( 𝑓 (D, P) , I𝑛) , I𝑚) ,

(5.1)

where ℎ 𝑝 (·) can be defined in forms such as structural similarity index measure (SSIM) [276, 327].

Recent multi-view works [240, 280, 94, 237, 75, 323, 258, 110] focus on improved mechanisms

which, in inference time, enforce Eqn. 5.1. Typically, they adopt an iterative and alternative

optimization scheme, minimizing Eqn. 5.1 by iteratively solving:

P† = arg min

ℎ 𝑝 (cid:0)𝑔 (cid:0) 𝑓 (D, P) , I 𝑗 (cid:1) , I𝑖(cid:1)

P

D† = arg min

ℎ 𝑝 (cid:0)𝑔 (cid:0) 𝑓 (D, P) , I 𝑗 (cid:1) , I𝑖(cid:1) .

D





(5.2a)

(5.2b)

For simplicity, Eqn. 5.2 is written with two-view inputs. Interestingly, their optimization is

primarily for pose estimation. If an optimal pose P† is given, video depth is estimated through a

single forward inference [240, 280, 94, 237, 75, 323, 258, 110]. In comparison, our method runs

sequentially. Given the input flow O and mono-depth initialization D∗, we decouple the video

68

depth estimation into two narrow-view robust objectives:

†, 𝑠† = arg min
P
p,𝑠

(cid:16)

(cid:16)

ℎ𝑒

P, O

𝜆 · ℎ𝑐

(cid:16)

𝑓

(cid:17)

(cid:16)

+

D∗, 𝑝

(cid:17)(cid:17)

(cid:16)

P, 𝑠

, O

(cid:17)(cid:17)

D† = arg min

ℎ 𝑝

(cid:16)

(cid:16)

𝑔

𝑓

(cid:16)

D∗, 𝑝

(cid:16)

(cid:17)(cid:17)

(cid:17)

(cid:17)

.

, I𝑖

, I 𝑗

P, 𝑠

D






(5.3a)

(5.3b)

Function 𝑝(·) combines normalized pose P with scale 𝑠: 𝑝

(cid:17)

(cid:16)

P, 𝑠

=

(cid:104)

(cid:105)

R 𝑠 · t

. D∗ and O are

initial mono-depthmap and flowmap. D† and 𝜆 are the optimized video depthmap and a predefined

weighting parameter. Functions ℎ𝑒 (·) and ℎ𝑐 (·) are epipolar and projection consistency constraints

detailed in Sec. 5.3.1.

The rest of the section presents our sequential pose and video depth estimation. We discuss

about the equality between scale and depth learning at the end of the section. The overall framework

is illustrated in Fig. 6.3.

5.3.1 Pose Estimation

We optimize Eqn. 5.3a in camera pose estimation. Given the flowmap O and mono-depthmap

D∗, we reformulate the five-point [143] algorithm with RANSAC [77] to include an additional

projection consistency constraint. Specifically, for each normalized pose P initiated by the five-

point algorithm, a pixel-wise camera scale is determined given the pixel-wise depth and flow pair.

The optimal scale is therefore selected by voting, see Fig. 5.4. This enables us to include a projection

constraint in addition to the epipolar constraint during the RANSAC consensus checking.

Random Normalized Pose Initiates. We denote the 𝑁𝑘 pixels randomly sampled from frame

I𝑚, flowmap O and monocular depthmap D∗ as {p}, {o} and {𝑑}. Then frame I𝑛’s corresponded

pixels {q} are given as {q𝑘 | q𝑘 = p𝑘 + o𝑘 , 𝑘 ∈ 𝑁𝑘 }, where 𝑁𝑘 is the number of randomly sampled

correspondence. For simplicity, we assume the RANSAC algorithm loops to the max iteration

number 𝑁𝑟, where 𝑟 indexes each RANSAC loop. Meanwhile, in each loop, a quick chirality

check [143] is applied to convert the essential matrix to the normalized pose. As such, we initiate

𝑟
𝑁𝑟 random normalized pose with the five-point algorithm, denoted as the set P = {P

| 𝑟 ∈ 𝑁𝑟 }.

69

(a) KITTI Flow

(b) NYUv2 Flow

Figure 5.5 Outdoor video motion patterns differ from indoor. Marked in yellow arrows, we visualize
an indoor and outdoor scene motion. In (a), a translation dominates the scene motion. In (b), a
rotation dominates the scene motion. Comparing (a) and (b), as rotation accumulates, the flow
becomes irrelevant to scene depth, making image clues less usable for depth. Further, it degenerates
the nonlinear projection transformation to the linear affine transformation, undermining the epipolar
constraint based five-point algorithm. We thus introduce the additional projection constraint ℎ𝑐 in
Eqn. 5.10. Further, we actively seek keyframes until sufficient translation movement is detected.
We plot the entire odometry on the corner of (a) and (b). As the color changes from blue to red,
more scene motion is from the rotation movement.

Pixel-wise scale estimation. Given any normalized pose P =

(cid:104)

(cid:105)

R t

, the depth value of each

pixel can determine a camera scale. We name the set of camera scales determined by each depth

pixel as pixel-wise scale s. Set p =

(cid:104)

𝑝𝑥 𝑝𝑦 1

(cid:105) ⊺

and q =

(cid:104)

𝑞𝑥 𝑞𝑦 1

(cid:105) ⊺

are the homogeneous pixel

coordinates in I𝑚 and I𝑛, connected by flow O at pixel p. Set camera projection as:

𝑑′q = 𝑑′ (cid:104)

𝑞𝑥 𝑞𝑦 1

(cid:105) ⊺

= 𝑑K R K−1 p + 𝑠K t.

(5.4)

The 𝑑 and 𝑑′ refer to depth at frame I𝑚 and I𝑛. By arranging Eqn. 5.4, we acquire the relationship

between depth 𝑑 and scale 𝑠 at horizontal and vertical directions separately as:
𝑥 − 𝑞𝑥 · 𝑧
⊺ p−m1
(cid:104)

𝑦 − 𝑞𝑦 · 𝑧
⊺ p−m2

, 𝑑 𝑦 = 𝑠

⊺ p
(cid:105) ⊺

𝑑𝑥 = 𝑠

𝑞𝑦m3

𝑞𝑥m3

⊺ p

(cid:105) ⊺

(cid:104)

.

Here

m1 m2 m3

= K t. As in Fig. 5.4 (a), optical flow induced pixel
q may not reside on the epipolar line lp, making 𝑑𝑥 and 𝑑 𝑦 possess different values. To pursue

𝑥 𝑦 𝑧

= K R K−1,

(5.5)

a unique mapping between scale 𝑠 and depth 𝑑, we compute the optimal pixel-wise scale 𝑠 by
minimizing the 𝐿2 distance between input monocular depth 𝑑 and 𝑑𝑥, 𝑑 𝑦:

𝑠 = arg min

𝑠

(𝑑𝑥 − 𝑑)2 + (𝑑 𝑦 − 𝑑)2 .

Then the pixel-wise mapping from depth 𝑑 to scale 𝑠 is:

log(𝑠) = log(𝑑) + 𝑚,

70

(5.6)

(5.7)

Figure 5.6 Illustration of video depth estimation. The shared encoder is drawn as one for simplicity
in Fig. 6.3. The encoder and decoder of video depth network D are plotted. We dynamically sample
the residual depth candidates D in log space centering around the initial depthmap D∗. Then we
construct cost volume V𝐷 with predicted normalized pose p† and the aligned scale 𝑠†. Finally, we
predict residual depth ΔD in log space through network D.

where 𝑚 = − log 1
2

(cid:16)

𝑥−𝑞 𝑥
𝑘 ·𝑧
⊺
3 p𝑘−m

⊺
1 p𝑘

𝑞 𝑥
𝑘 m

+

𝑞 𝑦
𝑘 m

(cid:17)

𝑦−𝑞 𝑦
𝑘 ·𝑧
⊺
3 p𝑘−m

⊺
2 p𝑘

material.

. The proof is detailed in the supplementary

Camera Scale Estimation. Next, we determine the unique camera scale 𝑠𝑟 from the pixel-wise

𝑟
scale set s𝑟 under normalized pose P
by majority voting, as shown in Fig. 5.4. Specifically, we
produce the histogram of the scale set s𝑟 as a 𝐵-dim vector r. For the 𝑏th element of r, its value
r[𝑏] is:

r[𝑏] =

𝑁𝑘∑︁

𝑘=1

(cid:18) 𝑏
𝐵

· 𝑠max ≤ 𝑠𝑘 <

𝑏 + 1
𝐵

(cid:19)

.

· 𝑠max

(5.8)

Hyper-parameter 𝑠max is the max scale value we record. The optimal scale 𝑠𝑟 under normalized

𝑟
pose P

is then:

𝑠𝑟 = 𝑠max

𝑏 + 0.5
𝐵

,

𝑏 = arg max
0≤𝑏<𝐵

r[𝑏].

(5.9)

To this step, for the 𝑁𝑟 randomly sampled normalized pose P in RANSAC, we conclude the

corresponded 𝑁𝑟 scale estimate, denoted as set S = {𝑠𝑟 | 𝑟 ∈ 𝑁𝑟 }.

Consensus Check. As in Fig. 5.5, we introduce an additional projection constraint ℎ𝑐 to stabilize

𝑟
the five-point algorithm in indoor videos. For the 𝑟th randomly sampled normalized pose P

, given

𝑟
{p}, {q}, {o} and {𝑑}, the original epipolar constraint ℎ𝑒 (P

, {o}) and the additional projection

𝑟
consistency constraint ℎ𝑐 (P

, 𝑠𝑟, {p}, {q}, {𝑑}) are:

71

Method
DORN [79]
BTS [137]
AdaBins [18]
NeWCRFs [309]
Ours + BTS [137]
Ours + AdaBins [18]
Ours + NeWCRFs [309]
BA-Net [237]
SfMR [268]
DeepMLE [58]
DRO [94]
MaGNet [6]

Venue
CVPR’18
Arxiv’18
CVPR’21
CVPR’22

CVPR’23

ICLR’19
CVPR’21
Arxiv’22
Arxiv’21
CVPR’22

DeepV2D [240]

ICLR’20

DeepV2cD [110]
Ours + MonoDepth2 [88]
Ours + BTS [137]
Ours + AdaBins [18]
Ours + NeWCRFs [309]

ICPRAI’22

CVPR’23

Frame Labels Abs Rel
0.069
D
0.059
D
0.058
D
0.052
D
0.037
D+F
0.045
D+F
0.041
D+F
0.083
D+P
0.055
D+F+P
0.060
D+F+P
0.047
D+P
0.051
D
0.064
D+P
0.037
D+P
0.037
D+P
0.032
D+F
0.029
D+F
0.030
D+F
0.028
D+F

1
1
1
1
2
2
2
5
2
2
2
3
2
5
5
2
2
2
2

Sq Rel RMSE RMSE log
0.300
0.245
0.190
0.155
0.110
0.108
0.107
0.025
0.224
0.203
0.199
0.160
0.350
0.174
0.167
0.106
0.098
0.089
0.087

0.112
0.096
0.088
0.079
0.059
0.064
0.059
0.134
0.091
0.089
0.082
0.079
0.120
0.074
0.073
0.057
0.053
0.052
0.049

2.857
2.756
2.360
2.129
1.809
1.817
1.748
3.640
2.273
2.257
2.629
2.077
2.964
2.005
1.984
1.889
1.729
1.655
1.597

𝛿 < 1.25
0.945
0.956
0.964
0.974
0.987
0.987
0.989
-
0.956
0.967
0.970
0.974
0.946
0.977
0.978
0.986
0.989
0.989
0.991

𝛿 < 1.252
0.998
0.993
0.995
0.997
0.998
0.998
0.998
-
0.984
0.995
0.994
0.995
0.982
0.993
0.994
0.998
0.998
0.998
0.998

𝛿 < 1.253
0.996
0.998
0.999
0.999
0.999
0.999
0.999
-
0.993
0.999
0.998
0.999
0.991
0.997
-
0.999
0.999
0.999
0.999

Table 5.1 KITTI Monocular Video Depth Evaluation on Eigen split [71] with Garg crop [82]
capped at 80 meters using semi-dense groundtruth [257]. The lower half table applies median
scaling [325] to the predicted depths to compare with SfM methods. [Key: Best, Second Best
except our work, Frame=the number of frames used in inference, Labels=required supervision in
training, D=semi-dense depthmap, P=IMU pose, F=synthetic optical flow datasets [166, 32]]

𝑟
ℎ𝑒 (P

, {o}) =

𝑁𝑟∑︁

(cid:16)

⊺
q

𝑘 K-⊤EK⊺p𝑘 < 𝑘 𝑒

(cid:17)

𝑟
ℎ𝑐 (P

𝑘=1
, 𝑠𝑟, {p}, {q}, {𝑑}) =
𝑁𝑟∑︁

𝑟
∥ 𝑓 (𝑑𝑘 , 𝑝(P

(cid:16)

, 𝑠𝑟)) − q𝑘 ∥2 < 𝑘𝑐

(5.10a)

(5.10b)

(cid:17)

.

𝑘=1






Here E is an essential matrix, expressed by the matrix form of the cross product [ ]× as

𝑟
E = R[t]×. The final consensus check number is a weighted summation of the two as ℎ(P

) =

ℎ𝑒 (·) + 𝜆 · ℎ𝑐 (·).

The optimal normalized pose P

†

and scale 𝑠† is selected with the highest consensus number.

The RANSAC stop criteria are updated with the new constraint ℎ(·).

Key-frame Search.

In Fig. 5.5, scene depth becomes irrelevant with scene motion under an

extreme pure rotation movement. Without the loss of generality, more 3D information is revealed

from two-view triangulation as the camera translation a.k.a., baseline, increases. For video captured

by a moving platform or a service robot, e.g., KITTI dataset, there typically exists sufficient camera

72

Venue
CVPR’18
Arxiv’18
CVPR’21
CVPR’22

CVPR’23

Method
DORN [79]
BTS [137]
AdaBins [18]
NewCRFs [309]
Ours + BTS [137]
Ours + AdaBins [18]
Ours + NewCRFs [309]
DfUSMC [98]
DeMoN [258]

DeepV2D [240]

ICLR’20

Ours + BTS [137]
Ours + AdaBins [18]
Ours + NewCRFs [309]

CVPR’23

1
1
1
1
2
2
2

Frame Abs Rel
0.115
0.108
0.103
0.095
0.102
0.095
0.090
0.447
0.144
0.094
0.061
0.070
0.064
0.057

2
2
9
2
2
2

Sc Inv RMSE
0.509
0.404
0.370
0.334
0.356
0.326
0.306
1.793
0.775
0.521
0.403
0.280
0.255
0.230

-
0.115
0.106
0.090
0.098
0.089
0.080
0.456
0.179
0.133
0.094
0.098
0.089
0.080

log10
-
0.047
0.044
0.041
0.044
0.040
0.038
0.169
0.061
0.403
0.026
0.030
0.027
0.025

𝛿 < 1.25
0.828
0.885
0.903
0.922
0.903
0.923
0.935
0.487
0.805
0.905
0.956
0.948
0.961
0.971

𝛿 < 1.252
0.965
0.978
0.983
0.992
0.984
0.990
0.995
0.697
0.951
0.975
0.989
0.991
0.994
0.996

𝛿 < 1.253
0.992
0.994
0.997
0.998
0.997
0.998
0.999
0.814
0.985
0.992
0.996
0.998
0.999
0.999

CVPR’16 Multi
CVPR’17

Table 5.2 NYUv2 Monocular Video Depth Evaluation. Results in the lower half table ap-
ply median scaling in evaluation. Results of DeMoN [258] is from [240]. Results of 2-view
DeepV2D [240] are evaluated with the published code and pretrained model. [Key: Red color
marks Best, Blue color marks the Second Best, Frame marks the number of frames in inference]

translation between consecutive frames. However, the camera rotation frequently dominates the

movement for the video taken by a hand-held camera, e.g., NYUv2 and ScanNet dataset. We

alleviate the issue by actively seeking sufficient camera translation. Automatically, as in Fig. 6.3,

we repeat the flow initialization step and pose estimation step with the next frame if the estimated

scale 𝑠† < 𝑘 𝑠, where 𝑘 𝑠 is a predefined minimum translation.

Scale Update. The camera scale 𝑠† will be updated with the finetuned video depthmap D† using

Eqn. 5.8 and Eqn. 5.9 if odometry is desired.

5.3.2 Video Depth Estimation

To this end, we have optimized Eqn. 5.3a. To optimize Eqn. 5.3b in inference, we adopt a cost
†, 𝑠†)

volume based network, taking in an initial monocular depthmap D∗, predicted pose P† = 𝑝(P

and a frame pair I𝑚/I𝑛 (see Fig. 6.3). We consider video depth estimation a log space residual

learning over its monocular depth initialization D∗. The meaning of residual is two-fold.

Construct Cost Volume V𝐷. We sample residual depth candidates D of size 𝑘 D around initial

monocular depthmap D∗ with predefined interval Δ𝑑 as:

D = (cid:8)D𝑖 ∥ D𝑖 = exp(Δ𝑑𝑖) · D∗(cid:9) 𝑘 D
𝑖=1

.

(5.11)

73

We then sample feature map F𝑛 according to D and predicted pose P as:

𝑑 = (cid:8)F∗
F ∗

𝑖 ∥ F∗

𝑖 = 𝑔( 𝑓 (D𝑖, P), F𝑛)(cid:9) 𝑘 D
𝑖=1

.

(5.12)

V𝐷 is then constructed by stacking F ∗

𝑑 and the repetition of input feature F𝑛, illustrated in Fig. 5.6.

Estimate Residual Depth. The cost volume is decoded by ResDepth network D, yielding a log

space residual depthmap ΔD for monocular initial D∗, preparing the final video depthmap D as:

D† = D∗ · exp(𝚫D) = D∗ · exp(D(V𝐷)).

(5.13)

Supervision Signal. Following [137], we use a scale-invariant loss, to supervise the training of

the depth network,

𝐷 (𝑤) =

1
𝑛

𝑛
∑︁

𝑖=1

𝑤2

𝑖 −

(cid:16)1
𝑛

𝑛
∑︁

𝑖=1

(cid:17) 2

𝑤𝑖

+ (1− 𝜇)

(cid:16) 1
𝑛

(cid:17) 2

𝑤𝑖

,

𝑛
∑︁

𝑖=1

(5.14)

where 𝑤𝑖 = log 𝑑𝑖 − log ˜𝑑𝑖, 𝑛 is the number of pixels and ˜𝑑𝑖 is groundtruth depth.

5.3.3 Equality of Scale and Video Depth Learning

In Fig. 6.3, scale is required before video depth estimation. Though scale can be optimized over

an initial mono-depthmap, augmenting it with a network seems a natural choice. In this section,

we show the equality of video depth and scale learning and its implication to the choice of scale

estimation. Following Eqn. 5.7, we define the optimal scale 𝑠∗ as the average of pixel-wise scale s:

log(𝑠∗) =

1
𝑛

𝑛
∑︁

𝑖=1

log(𝑠𝑖) =

1
𝑛

𝑛
∑︁

𝑖=1

(cid:0) log(𝑑𝑖) + 𝑚𝑖(cid:1).

(5.15)

We then show that the learning objective for scale 𝑠∗ can be approximated as the learning objective

for video depth and a noise term contributed by normalized pose P and optical flow O estimate:

𝐿𝑠∗ = ∥ log( ˜𝑠) − log(𝑠∗) ∥

≤

1
𝑛

𝑛
∑︁

𝑖=1

∥ log( ˜𝑑𝑖) −log(𝑑𝑖) ∥ + ∥

1
𝑛

𝑛
∑︁

𝑖=1

( ˜𝑚𝑖 −𝑚𝑖) ∥.

(5.16)

Here, ˜𝑠 and ˜𝑑 are groundtruth scale and depth. Estimating scale, by minimizing 𝐿𝑠∗, can be

approximately achieved by minimizing its upper-bound in Eqn. 5.16, thus converting to video

depth estimation. This indicates that a deep scale estimator learns the same prior knowledge as a

video depth estimator. We empirically support our analysis by showing that the framework in Supp

Fig. 1 has no benefit in final depth and scale performance, as in Tab. 5.5.

74

09

Seq Err BetterGen∗ [322] LTMVO∗ [334] DfVWild∗ [90] MLF-VO [120] SfMR [268] LSR∗ † [264]
3.10
-
5.40
-

1.70
0.48
1.49
0.55

6.03
0.44
4.66
0.62

3.90
1.41
4.88
1.38

3.49
1.03
5.81
1.82

1.19
0.30
1.34
0.37

𝑡err
𝑟err
𝑡err
𝑟err

10

Ours
1.08 ± 0.07
0.28 ± 0.02
1.29 ± 0.04
0.36 ± 0.02

Seq Err DeepV2d [240]

00

05

𝑡err
𝑟err
𝑡err
𝑟err

3.80
1.66
3.25
1.34

Ours
1.19 ± 0.04
0.39 ± 0.02
1.36 ± 0.05
0.40 ± 0.03

Table 5.3 KITTI Odometry Evaluation. Results in the right of the table are trained on Eigen
split [71] and tested on odometry sequence 00 and 05. Performance is reported with 5 random runs.
Self-supervised methods are marked with *. † uses test time parameter fine-tuning (PFT) [264].
[Key: Red color marks Best, Blue color marks the Second Best]

5.4 Experiments

We evaluate depth on KITTI and NYUv2 where both video and monocular depth methods

report their results. We conduct indoor pose comparison on ScanNet as NYUv2 does not have pose

groundtruth.

Implementation Details For both KITTI and NYUv2 experiments, we train with the Adam

optimizer [128] with a learning rate of 1𝑒−4. The training takes 20 epochs with a batch size of

4. We train 2 days on 2 RTX 2080 Ti GPUs. For the pre-computed initial monocular depthmap,

we apply color augmentation to ensure consistent performance between validation and training

set. We use BTS [137] during training but test against various mono-depth inputs. For all three

monocular methods, BTS [137], AdaBins [18], and NewCRFs [309], we use the author released

models. The Monodepth2 [88] is re-trained by us. For flow, we adopt the publicly available model

of RAFT [241] trained using the synthetic datasets [166]. On KITTI, we train with a cropped

320 × 576 resolution. On NYUv2, we train with the original resolution. For both datasets, we

test with their full resolution. The residual depth candidates D with a size of 𝑘 D = 32. While

selecting the random correspondences from flowmap for pose estimation, we do not apply forward-

backward consistency [322] as the improvement does not worth its running time. But we exclude

the invisible area and object edges in the next views. We use the OpenCV’s EPnP [139] algorithm

as a replacement if the five-point algorithms fail.

5.4.1 Monocular Video Depth and Pose Estimation

KITTI Depth KITTI is a widely adopted benchmark for outdoor scenes with stereo, LiDAR, and

GPS/IMU available. For fair comparison, we train with Eigen split [71], evaluated on semi-dense

groundtruth [257] under Garg crop [82] capped at 80 meters. Tab. 5.1 reports results in standard

75

(a) RGB Input I𝑡

(b) DeepV2D [240]

(c) Init D∗

(d) Finetuned D†

(e) Residual (meter)

Figure 5.7 Subplot (e) shows residual depth D∗ · (exp(ΔD) − 1) in meter. In Green boxes, mono-
depthmap gets improved after residual estimation. In Pink boxes, artifacts around moving fore-
ground objects are avoided.

7 metrics [71], with baselines from both single-view and multi-view methods. We outperform all

of them by a substantial margin. Particularly, compared to 2-view methods [94, 268], our method

significantly reduces 66.7% and 77.3% errors on the 𝑎1 metric (𝛿 < 1.25). Additionally, we are

the first 2-view work to outperform the 5-view SoTA performance [240], achieving a substantial
improvement of 60.9% (= 0.991−0.977
1−0.977

) on 𝑎1 metric. Further, we reduce 70.5% 𝑎1 metric error

compared to our mono-depth initialization BTS. Fig. 5.7 shows our improvement qualitatively.

Finally, our performance gain over prior SoTA does not attribute to monocular initialization. In

Tab. 5.1, our result still substantially outperforms DeepV2D with a lightweight MonoDepth2

monocular initialization.

NYUv2 Depth NYUv2 dataset [178] has RGB and depth image pairs in indoor environments.

Our experiment follows the standard train/test split [71]. As NYUv2 is captured by a handheld

camera, rotation frequently dominates camera motion across frames, which is undesirable for video

depth estimation (see Fig. 5.5). Despite all the hurdles, our 2-view performance grouped with

NewCRFs [309] still substantially outperforms 8-view DeepV2D, reducing 34.1% error on 𝑎1

metric. Compared to its 2-view performance, the improvement goes up to 46.3%.

Further, our method shows great generalization ability under different mono-initialization.

Though trained with BTS, when tested with BTS, AdaBins, and NewCRFs, we reduce error on 𝑎1

metric by 15.7%, 20.6%, and 16.7%, respectively. However, this performance gain is less than in

KITTI (15.7% to 70.5%), indicating our method shines more on videos with sufficient translation.

KITTI Pose KITTI Odometry includes 20 driving videos with 11 having odometry groundtruth.

Our experiment includes both self-supervised and supervised methods and reports standard met-

76

ScanNet
Rotation (degree) ↓
Translation (degree) ↓
Translation (cm) ↓

DeMoN [258] BA-Net [237] DSO DeepV2D-2 DeepV2D-8 FivePoint

3.791
31.626
15.500

1.009
14.626
2.365

0.946
19.238
2.165

0.806
13.259
1.726

0.714
12.205
1.514

0.671
13.878
1.524

Ours
0.621 ± 0.007
12.840 ± 0.161
1.440 ± 0.011

Table 5.4 ScanNet Pose Evaluation. DeMoN, BA-Net, and DSO are trained on ScanNet. DSO is
evaluated only on success cases. DeepV2D and ours are trained on NYUv2 and tested on ScanNet.
DeepV2D-2/8 are DeepV2D taking 2 or 8 frames. FivePoint is the baseline five-point algorithm
with RANSAC. Our result is reported with 5 random runs. [Key: Red color marks Best, Blue color
marks the Second Best]

I
T
T
I
K

ResDepth PoesEstimation ScaleNet Abs Rel
0.070
0.038
0.037

✓
✓
✓

✓
✓

✓

Sq Rel RMSE RMSE log
0.275
0.110
0.117

2.405
1.821
1.841

0.093
0.060
0.059

𝛿 < 1.25
0.959
0.987
0.986

Seq-00 𝑡err
1.55
1.55
1.24

Table 5.5 Ablation on Outdoor Video Depth Estimation.
depth learning (Sec. 5.3.2).
‘ScaleNet’=Further refine pose scale with an additional ScaleNet (detailed in Supplementary).]

‘ResDepth’= Residual
‘PoseEstimation’= Proposed Pose Estimation Method (Sec. 5.3.1).

[Key:

rics [90]. For methods [322, 334, 90, 264, 268, 120], we follow [90] to train/test on sequences

00-08/09-10. For DeepV2D [240], as trained on Eigen split [71], we test on unseen sequences

00 and 05. As odometry from self-supervised methods lacks real-world scale priors, we align

prediction against groundtruth trajectory by applying 7 DoF transformation [322] during inference.

In Tab. 5.3, we outperform SoTA on rotation and translation errors.

ScanNet Pose ScanNet [52] is a large indoor dataset with groundtruth depthmap and camera

trajectory. We follow DeepV2D’s test protocol, train on NYUv2, and test on 2, 000 sequences

of ScanNet. We outperform 8 frames DeepV2D-8 except for the metric ‘tr. (deg)’. Further, our

method achieves solid improvement over 2-view DeepV2D.

5.4.2 Ablation Study

The Equality between Scale and Video Depth Learning In Tab. 5.5 row 2 & 3, we ablate

pose & depth performance if augment pose scale learning with an additional ScaleNet (detailed

in Supplementary). Clearly, the added ScaleNet learns additional scale prior, reducing 𝑡err from

1.55 to 1.24. However, the improved pose scale does not benefit video depth due to the equality

between their learning objective. Further, this benefit diminishes after updating the scale with

video depthmap (1.19 from Tab. 5.3 and 1.24 from Tab. 5.5). This is expected, as the LiDAR

depth possesses less noise than IMU and GPS pose. Thus we empirically demonstrate the equality

77

2
v
U
Y
N

✓

FivePoint PoesEstimation KeySearch Abs Rel
0.063
0.061
0.057

✓
✓

✓

Sc Inv RMSE
0.248
0.087
0.239
0.083
0.230
0.080

log10
0.027
0.026
0.025

𝛿 < 1.25
0.964
0.968
0.971

Table 5.6 Ablation on Indoor Video Depth Estimation.
‘FivePoint’=Baseline Five-
point algorithm with RANSAC. ‘PoseEstimation’=Proposed Pose Estimation Method (Sec. 5.3.1).
‘KeySearch’=Keyframe search. Bold marks the best score.]

[Key:

between scale and video depth learning.

Residual Depth Estimation Estimating video depth as logged residual improves cost volume sam-

pling efficiency, supported by our improvement over SfMR [268] in Tab. 5.1 and the performance

gap in row 1 and 2 of Tab. 5.5. Meanwhile, it avoids artifacts in moving objects, as in Fig. 5.7.

Pose Estimation and Key-frame Search Compared to using baseline five-point algorithm over

flow estimate [268, 322], our proposed method benefits both pose and depth performance, as shown

in Tabs. 5.4 and 5.6. Also, ensuring sufficient camera translation shows noticeable improvement,

as shown in Tab. 5.6.

Computational Efficiency We compare the running time to DeepV2D [240] on an RTX 2080 Ti

GPU, for 192 × 1088 images. In Fig. 6.3, our inference has 1 + 2 steps: initialization of flow [241]

and mono-depth [137], pose estimation, and video depth estimation. Each takes 0.124 + 0.063,

0.253, 0.058s respectively, in total 0.498s. In comparison, 5-view DeepV2D takes 1.619s.

5.5 Conclusions

Video depth estimation in prior works is solved as a simplified SfM problem. But video depth

has fewer view angles in video depth estimation. Thus, we decompose it into two sub-tasks that are

robust to deficient views, i.e., normalized pose, and residual depth estimation. We connect the two

tasks with a scale alignment algorithm. The proposed framework improves both pose and video

depth.

Limitations Our method depends on multiple modality initializations. A joint model is preferred.

78

CHAPTER 6

RSFM: REVISIT SELF-SUPERVISED DEPTH ESTIMATION WITH LOCAL
STRUCTURE-FROM-MOTION

Both self-supervised depth estimation and Structure-from-Motion (SfM) recover scene depth from

RGB videos. Despite sharing a similar objective, the two approaches are disconnected. Prior works

of self-supervision backpropagate losses defined within immediate neighboring frames. Instead of

learning-through-loss, this work proposes an alternative scheme by performing local SfM. First,

with calibrated RGB or RGB-D images, we employ a depth and correspondence estimator to

infer depthmaps and pair-wise correspondence maps. Then, a novel bundle-RANSAC-adjustment

algorithm jointly optimizes camera poses and one depth adjustment for each depthmap. Finally, we

fix camera poses and employ a NeRF, however, without a neural network, for dense triangulation and

geometric verification. Poses, depth adjustments, and triangulated sparse depths are our outputs.

For the first time, we show self-supervision within 5 frames already benefits SoTA supervised

depth and correspondence models. Despite self-supervision, our pose algorithm has certified

global optimality, outperforming optimization-based, learning-based, and NeRF-based prior arts.

6.1

Introduction

Monocular depth estimation [79, 137] infers depthmap from a single image. It is an essential

vision task with applications in AR/VR [182], autonomous driving [84], and 3D reconstruction [31].

Most methods [309, 19, 195, 189] supervise the model with groundtruth collected from stereo

cameras [319] or LiDAR [84]. Recently, self-supervised depth [90, 88, 327] has drawn significant

attention due to its potential to scale up depth learning from massive unlabeled RGB videos.

Classic SfM methods [209, 228, 205, 286, 1, 50] also reconstruct scene depth from unlabled RGB

videos. Despite its relevance, SfM is rarely applied to self-supervised depth learning. We outline

two potential reasons. First, SfM is an off-the-shelf algorithm unrelated to the depth estimator. Scale

ambiguity renders SfM poses and depths at different scales compared to depth models. Second,

self-supervision has a well-defined training scheme to work with universal unlabeled videos. It

backpropagates through photometric loss computed within immediate neighboring frames, e.g.,

79

Figure 6.1 Revisit Self-supervision with Local SfM. The work proposes alternating the learning-
through-loss with a local SfM pipeline for self-supervised depth estimation. We summarize our
differences. On self-supervision: (1) Instead of using naive two-view camera poses, we propose a
Bundle-RANSAC-Adjustment pose optimization algorithm with multi-view constraints. (2) Instead
of backpropagating through a loss, we produce a sparse point cloud with explicit triangulation and
geometric verification. The point cloud serves as either output or pseudo-groundtruth for self-
supervision. On SfM: (1) Our local SfM is adapted to use estimated monocular depthmaps and
automatically resolve their scale inconsistency between pairs of images. (2) We maintain accuracy
under significant sparse view variations, e.g., red trajectories. We generalize SfM to as few as 5
frames, similar to the number of images used to define self-supervision loss.

red trajectory in Fig. 7.1. In contrast, SfM is more selective to input videos. It requires images of

diverse view variations (green trajectory in Fig. 7.1), being inaccurate and unstable when applied

to a small frame window.

This work connects self-supervision with SfM. We replace the self-supervision loss with a

complete SfM pipeline that maintains robustness to a local window. Shown in Fig. 6.2, with

𝑁 frames as input, our algorithm outputs 𝑁 − 1 camera poses, 𝑁 − 1 depth adjustments, and

the sparse triangulated point cloud. In initialization, 𝑁 monocular depthmaps and 𝑁 × (𝑁 − 1)

pairwise correspondence maps are inferred. Next, we propose a Bundle-RANSAC-Adjustment pose

estimation algorithm that retains accuracy for second-long videos. The algorithm utilizes the 3D

priors from monocular depthmap to compensate for the deficient camera views. Correspondingly,

we optimize 𝑁 − 1 depth adjustments to alleviate the depth scale ambiguity by temporally aligning

to the root frame depth.

The Bundle-RANSAC-Adjustment extends two-view RANSAC with multi-view bundle-adjustment

80

Figure 6.2 Local Structure-from-Motion. With 𝑁 neighboring frames, we extract monocular
depthmaps and pairwise dense correspondence maps with methods, e.g., ZoeDepth [19] and PDC-
Net [251]. Next, skipping the root frame, we optimize the rest 𝑁 − 1 camera poses and depth
adjustments. The depth adjustments render input depthmaps temporally consistent. Fixing poses
and adjustments, we use the Radiance Field (RF) for triangulation. A geometrically verified sparse
root depthmap is output. Our local SfM applies self-supervision with only 5 RGB frames. Yet,
our sparse output already outperforms the input supervised depth with SoTA performance.

(BA). The algorithm has quadratic complexity and is designed for parallel GPU computation. We

RANdomly SAmple and hypothesize a set of normalized poses. In Consensus checking, we apply

BA to evaluate a robust inlier-counting scoring function over multi-view images. Camera scales

and depth adjustments are determined during BA to maximize the scoring function.

Next, we freeze the optimized poses and employ a Radiance Field (RF), i.e., a NeRFF [170]

without a neural network, for triangulation. We optimize RF to achieve multi-view depthmap and

correspondence consistency within a shared 3D frustum volume. For outputs, we apply geometric

verification to extract multi-view consistent point cloud, i.e., a sparse root depthmap.

Fig. 7.1 contrasts our method with prior self-supervised depth and SfM methods. To our best

knowledge, there has not been prior work showing geometry-based self-supervised depth benefits

supervised models. However, self-supervision is supposed to augment supervised models with

unlabeled data. In Fig. 6.2, our unique pipeline gives the first evident results, that self-supervision

with as few as 5 frames already benefits supervised models.

Despite depths, our multi-view RANSAC pose has certified global optimality under a robust

scoring function. It outperforms prior arts in optimization-based [209, 330], learning-based [240,

273], and NeRF-based [253] pose algorithms.

81

Beyond pose and depth, our method has diverse applications. The depth adjustments from our

method provide empirically consistent depthmaps, being important for AR image compositing.

When with RGB-D inputs, our method enables self-supervised correspondence estimation. Our

accurate pose estimation gives improved projective correspondence than the SoTA supervised

correspondence input. An example is in Fig. 6.9. We summarize our contributions as:

• We propose a novel local SfM algorithm with Bundle-RANSAC-Adjustment.

• We show the first evident result that self-supervised depth with as few as 5 frames already

benefit SoTA supervised models.

• We achieve SoTA sparse-view pose estimation performance.

• We enable self-supervised temporally consistent depthmaps.

• We enable self-supervised correspondence estimation with 5 RGB-D frames.

6.2 Related Works

Structure-from-Motion.

SfM is a comprehensive task [209, 286]. A typical pipeline is,

correspondence extraction [160, 254, 30], two-view initialization [17, 143], triangulation [142, 183],

and local & global bundle-adjustment [209, 286]. Classic methods require diverse view variations

for accurate reconstruction. Our method compensates SfM on scarse camera views via introducing

deep depth estimator. Further, we suggest SfM itself is a self-supervised learning pipeline, as in

Fig. 7.1. Finally, our SfM is not up-to-scale and shares the metric space as the input depthmap.

Sparse Multi-view Pose Estimation. Estimating poses from sparse frames is crucial for self-

supervision [88, 325, 197, 312, 47], video depth estimation [330, 240, 93, 258], and sparse-view

NeRF [253, 60, 119, 148, 180]. Camera poses are estimated either by learning [88, 47, 240, 93],

optimization [330, 322] or together with NeRF [253, 148]. We propose an additional multi-view

RANSAC pipeline with improved accuracy.

Self-supervised Depth and Correspondence Estimation.

Multiple works improve self-

supervised depth in different ways, including learning loss [88, 278, 191], architecture [95, 326],

82

camera pose [165, 322, 45], joint with semantics segmentation [327], and using large-scale

data [229, 295]. Recently, [229] shows self-supervision only performs on-par with supervised

models under substantially more data. [295] shows the benefit of self-supervision via exploiting

non-geometry monocular semantic consistency. Our method shows the first evident results where

self-supervision benefits supervised models with only 5 consecutive frames.

Consistent Depth Estimation. AR applications necessitate temporally consistent depthmaps, i.e.,

depthmaps from different temporal frames reside in the same 3D space. Recent works [321, 162]

align depthmap according to the poses and points from the off-the-shelf COLMAP algorithm. Our

method seamlessly integrates SfM with monocular depthmaps, outputting consistent depth and

poses.

Test Time Refinement (TTR). TTR aims to improve self-supervised / supervised depth estimators

in testing time with RGB video [45, 35, 279, 221, 134]. Methods [116, 245] rely on off-the-shelf

algorithms for pseudo depth and pose labels. Recently, [116] first shows TTR improves supervised

models. TTR is our downstream application, which details strategies for utilizing noisy pseudo-

labels.

6.3 Methodology

Our method runs sequentially. From 𝑁 calibrated images I, we extract 𝑁 monocular depthmaps

D and 𝑁 × (𝑁 − 1) pair-wise dense correspondence C. We split the 𝑁 images into one root frame
I𝑜 in the center of the 𝑁-frame window where 𝑜 = ⌊ 𝑁+1

2 ⌋, and 𝑁 − 1 support frames I𝑖, where
𝑖 ∈ N+ = [1, 𝑁]\{𝑜}. In Sec. 6.3.1, after setting the root frame as identity pose, we use Bundle-

RANSAC-Adjustment to optimize 𝑁 − 1 poses P and 𝑁 − 1 depth adjustments R. Next, in

Sec. 6.3.2, we apply triangulation by optimizing a frustum Radiance Field (RF) V, i.e., a NeRF

without network. Finally, in Sec. 6.3.3, we apply geometric verification by rendering multi-view

consistent 3D points from RF. An overview is in Fig. 6.3.

6.3.1 Bundle-RANSAC-Adjustment Pose Estimation

We generalize two-view RANSAC with multi-view constraints through Bundle-Adjustment.

Sec. 6.3.1.1 describes our pipeline.

In Sec. 6.3.1.2, we propose Hough transform to accelerate

83

Figure 6.3 Algorithm Overview. After extracting monodepths and correspondence maps from
inputs: (a) We apply Bundle-RANSAC-Adjustment to optimize 𝑁 − 1 camera poses P and 𝑁 − 1
depth adjustments R. (b) We fix poses and depth adjustments and optimize a frustum Radiance
Field (RF) for triangulation. (c) We apply geometric verification to extract multi-view consistent
3D points via rendering with RF. We further detail step (a) in Fig. 6.4, 6.5, and 6.6, and steps (b)
and (c) in Fig. 6.7.

computation. We discuss the time complexity in Sec. 6.3.1.3.

6.3.1.1 Optimization Pipeline

RANdom SAmple. We use five-point algorithm [143] as the minimal solver. We execute it

between root and each support frame, extracting a pool of (𝑁 − 1) × 𝐾 normalized poses (i.e.,

𝑘
pose of unit translation), Q = {P
𝑖

| 𝑖 ∈ N+, 𝑘 ∈ [1, 𝐾]}, where P

𝑘
𝑖 ∈ R3×4. The 𝐾 is the number

of normalized poses extracted per frame. We term a set of 𝑁 − 1 normalized poses as a group

P ∈ R(𝑁−1)×3×4. Two-view RANSAC enumerates over single normalized pose P. Our multi-view

algorithm hence enumerates over normalized pose group P. We initialize the optimal group P

∗

as

the top candidate from 𝐾 poses of Q for each frame. See examples in Fig. 6.4.

Bundle-Adjustment Consensus. While computing consensus counts, the camera scales S and

depth adjustments R are automatically determined with bundle-adjustment to maximize a robust

scoring function:

𝜌𝑖 = 𝜙(P) = max
S,R

𝑓 (S, R | P, D, C).

(6.1)

Search for Optimal Group. Our multi-view RANSAC has a significantly larger solution space

than two-view RANSAC. With 𝑁 view inputs, we determine the optimal group out of 𝐾 𝑁−1

combinations. Hence, we iteratively search for the optimal group with a greedy strategy. For each

84

Figure 6.4 Pose Optimization Pipeline. We show a sample execution when 𝑁 = 3 and 𝐾 = 3. We
is set to top candidates within Q.
initialize normalized pose candidates pool Q. Optimal group P
𝑘
In each epoch, Eq. (6.2) ablates pose group P
𝑖 . Each group is scored with Eq. (6.1) via BA with
Hough Transform, detailed in Sec. 6.3.1.2. The optimal group with the highest score is updated
with Eq. (6.3). Termination occurs when the maximum score stabilizes. We maintain quadratic
complexity by avoiding repetitive computation after the first epoch, shown with the Comp. Graph,
detailed in Sec. 6.3.1.3.

∗

epoch, we ablate (𝑁 − 1)(𝐾 − 1) additional pose groups:

𝑘
𝑖 = P

P

∗
𝑖 \ {P

∗
𝑖 } ∪ {P

𝑘
𝑖 },

(6.2)

where 𝑖 ∈ N+ and 𝑘 ∈ [1, 𝐾]. Combine Eq. (6.2) and Fig. 6.4, taking frame 𝑖 as an example,

∗
𝑖 by its 𝐾 − 1 other candidates P
we replace the optimal pose P

𝑘
𝑖 , generating 𝐾 − 1 groups. For

𝑁 frames, we have (𝑁 − 1)(𝐾 − 1) + 1 groups. We apply bundle-adjustment to each group to

evaluate Eq. (6.1). As shown in Fig. 6.3 and Fig. 6.4, we select the normalized pose together with

its optimized scales and depth adjustments that maximize the scores as the output,

P∗

𝑖 = 𝑏(P

∗
𝑖 , S∗

𝑖 ), R∗

𝑖 = R 𝑘

𝑖 , where 𝑘 = arg max{𝜌𝑘

𝑖 }, P

∗
𝑖 = P

𝑘
𝑖 , S∗

𝑖 = S 𝑘
𝑖 ,

(6.3)

where 𝑏(·) combines normalized poses with scales. Fig. 6.2 third column plots an adjusted temporal

consistent depthmap after applying R∗. In Fig. 6.4, the algorithm terminates when the maximum

score stops increasing.

Scoring Function.

Similar to other RANSAC methods, we adopt robust inlier-counting based

85

scoring functions. Expand Eq. (6.1) for a specific group P:

𝜙(P) =

∑︁

∑︁

𝑖,𝑖≠ 𝑗

𝑗

𝑓𝑖, 𝑗 (𝑠𝑖, 𝑠 𝑗 , 𝑟𝑖, 𝑟 𝑗

| P𝑖, P 𝑗 , D𝑖, D 𝑗 , C𝑖, 𝑗 ),

(6.4)

where 𝑖, 𝑗 are frame index. We set per-frame camera scale, depth, depth adjustment, and corre-

spondence as 𝑠 ∈ S, D ∈ D, 𝑟 ∈ R, and C ∈ C. The scoring function 𝑓𝑖, 𝑗 (·) has various forms.

First, we describe a 2D scoring function:

𝑓 2D
𝑖, 𝑗 (·) =

∑︁

(cid:16)

1

∥𝜋(𝑠𝑖, 𝑠 𝑗 , 𝑟𝑖 | P𝑖, P 𝑗 , 𝑑𝑚

𝑖 ) − c𝑚

𝑖, 𝑗 ∥2 < 𝜆2D(cid:17)

,

(6.5)

𝑚
where 𝑚 ∈ [1, 𝑀] indexes sampled pixels per frame pair. 𝑓 2D

𝑖, 𝑗 (·) measures the inlier count between

depth projected correspondence and input correspondence. 𝜋(·) is projection process. Intrinsic is

skipped. 𝑑 and c are depth and correspondence sampled from D and C. An example is in Fig. 6.5.

The 1(·) is the indicator function. The projected pixel is an inlier if it resides within the circle

of radius 𝜆2D and center at correspondence c𝑚

𝑖, 𝑗 (denoted as p 𝑗 in Fig. 6.5). c𝑚

𝑖, 𝑗 is sampled from

correspondence map C𝑖, 𝑗 . Second, we introduce a 3D scoring function:

𝑓 3D
𝑖, 𝑗 (·) =

∑︁

(cid:16)

1

𝑚

∥𝜋-1(𝑠𝑖 | P𝑖, 𝑟𝑖, 𝑑𝑚

𝑖 ) − 𝜋-1(𝑠 𝑗

| P 𝑗 , 𝑟 𝑗 , 𝑑𝑚

𝑗 ) ∥2 < 𝜆3D(cid:17)

.

(6.6)

Depth pair 𝑑𝑖 and 𝑑 𝑗 is determined by correspondence. Unlike the 2D one, the 3D function fixes

depth adjustment 𝑟. Function 𝜋−1(·) back-projects 3D point.

6.3.1.2 Hough Transform Acceleration

Maximizing Eq. (6.1) for each pose group is computationally prohibitive, as shown in Fig. 6.4. We

propose Hough Transform for acceleration. We use Eq. (6.5), the 2D function 𝑓 2D(·) as an example

for illustration. See our motivation in Fig. 6.5.

Hough Transform. The relative pose between P𝑖 and P 𝑗 is defined as:

P𝑖, 𝑗 = P 𝑗 P-1

𝑖 =

(cid:104)

R𝑖, 𝑗

𝑠𝑖, 𝑗 t𝑖, 𝑗

(cid:105)

(cid:104)

=

R 𝑗 R-1
𝑖

−𝑠𝑖R 𝑗 R-1

𝑖 t𝑖 + 𝑠 𝑗 t 𝑗

(cid:105)

,

(6.7)

where R, t, and 𝑠 are rotation, normalized translation and pose scale. From Eq. (6.7) and Fig. 6.5,

t𝑖, 𝑗 is controlled by the scale 𝑠𝑖 and 𝑠 𝑗 , and thus we have:

lim
𝑠𝑖→+ inf

t𝑖, 𝑗 = −R 𝑗 R-1

𝑖 t𝑖,

lim
𝑠 𝑗 →+ inf

t𝑖, 𝑗 = t 𝑗 .

(6.8)

86

Figure 6.5 Hough Transform between Two Normalized Poses. With fixed normalized poses,
there exists three variables, scales 𝑠𝑖 & 𝑠 𝑗 of P𝑖 & P 𝑗 and adjustment 𝑟𝑖. Pixel p𝑖 and p 𝑗 are
corresponded. Ablating pose scales maps pixel p𝑖 to a set of epipolar lines {l𝑖}, however, bounded
by Red and Green at infinite scales. We have three observations. First, with fixed normalized poses,
epipolar lines l𝑖 have limited possibilities. Second, scale 𝑠 and depth adjustment 𝑑 are equivalent,
both adjusting projection on epipolar line. Third, per epipolar line, to be an inlier, the projection
has to reside within the line-circle intersection, between pst
𝜋 . The observations motivate
us to discretize the solution space to a 2D matrix, i.e., Hough Transform. Right figure plots an
example transformation H𝑚

𝑖, 𝑗 from frame 𝑖 to 𝑗 on the 𝑚th pixel p𝑖.

𝜋 and ped

For a pixel p𝑖 on frame 𝑖, its corresponding epipolar line l𝑖 on frame 𝑗 is:

l𝑖 = K-⊺ [ t𝑖, 𝑗 ]×R𝑖, 𝑗 K-1p𝑖.

(6.9)

Eq. (6.8) and Eq. (6.9) suggest the epipolar line has limited possibilities. Operation [·]× is the

cross product in matrix form. Further, as the depth re-projected pixel p𝜋 of p𝑖 always locate on the

epipolar line l𝑖 [99], we have:

⊺
𝑖 p𝜋 = 0, p𝜋 = 𝜋(𝑠𝑖, 𝑠 𝑗 , 𝑟𝑖 | P𝑖, P 𝑗 , 𝑑𝑖).
l

To be an inlier of the scoring function 𝑓 2D(·), we have:

∥p𝜋 − p 𝑗 ∥2 ≤ 𝜆2D.

(6.10)

(6.11)

Combining Eq. (6.10), Eq. (6.11) and Fig. 6.5, to be an inlier, the projected pixel p𝜋 has to reside

within the line segment, with two end-points computed by the line-circle intersection. The circle

centers at corresponded pixel p 𝑗 on frame 𝑗 with a radius 𝜆2D. We denote the two end-points pst
𝜋

and ped

𝜋 . Function 𝐽 (·) follows [330] Supp. Eq. (4), which maps a projected pixel p𝜋 and adjusted

depth 𝑟𝑖𝑑𝑖 to camera scale 𝑠𝑖, 𝑗 as: 𝑠𝑖, 𝑗 = 𝐽 (P𝑖, 𝑗 , 𝑟𝑖𝑑𝑖, p𝜋).

87

𝑗
Figure 6.6 Visualize Hough Transform Matrix H
𝑖 from Eq. (6.18). Area with higher intensity
suggests more inlier counts. Given normalized pose group, for 𝑁 views, there exists 𝑁 × (𝑁 − 1)
𝑗
𝑖 , constraining 𝑁 − 1 scale and 𝑁 − 1 adjustments. We plot the start and end points after
matrices H
optimizing Eq. (6.19) in the figure.

Corollary 1 A pixel is an inlier iff:

𝐽 (P𝑖, 𝑗 , 𝑟𝑖𝑑𝑖, pst

𝜋 ) ≤ 𝑠𝑖, 𝑗 ≤ 𝐽 (P𝑖, 𝑗 , 𝑟𝑖𝑑𝑖, ped

𝜋 ).

(6.12)

Corollary 2 Scale and depth are equivalent as;

𝑠𝑖, 𝑗 = 𝐽 (P𝑖, 𝑗 , 𝑟𝑖𝑑𝑖, p𝜋) = 𝑟𝑖 · 𝐽 (P𝑖, 𝑗 , 𝑑𝑖, p𝜋).

(6.13)

Combine Eqs. (6.12) and (6.13),

𝐽 (P𝑖, 𝑗 , 𝑑𝑖, pst

𝜋 ) ≤

𝑠𝑖, 𝑗
𝑟𝑖

≤ 𝐽 (P𝑖, 𝑗 , 𝑑𝑖, ped

𝜋 ).

(6.14)

Set 𝑔(·) maps the variables under optimization to intermediate term

𝑠𝑖, 𝑗
𝑟𝑖 :

𝐽 (P𝑖, 𝑗 , 𝑑𝑖, pst

𝜋 ) ≤ 𝑔(𝑟𝑖, 𝑠𝑖, 𝑠 𝑗

| P𝑖, P 𝑗 ) ≤ 𝐽 (P𝑖, 𝑗 , 𝑑𝑖, ped

𝜋 ).

(6.15)

The 𝑖th pixel is an inlier if and only if its projection satisfies Eq. (6.15). Note, the value space of

function 𝑔(·) is mapped to a 2D space H after Hough Transform:

𝑥 = 𝑔(𝑟𝑖, 𝑠𝑖, 𝑠 𝑗

⊺
𝑖, 𝑗 t 𝑗 ),
| P𝑖, P 𝑗 ), 𝑦 = arccos(t

(6.16)

where 𝑥 and 𝑦 are transformed coordinates. From Eq. (6.16), 𝑥 is a synthesized translation

magnitude and 𝑦 is angular variable. We then set 𝑥 ∈ [0, 𝑥max], and 𝑦 ∈ [0, 𝜃max], where

⊺
𝜃max = arccos(−t
𝑗 R𝑖, 𝑗 t𝑖). Finally, the value of H is:

∀𝑦 ∈ [0, 𝜃max], H(𝑥 | 𝑦) = 1, if 𝑥 ∈ [𝐽min, 𝐽max],

(6.17)

88

where 𝐽min and 𝐽max are the two bounds from Eq. (6.15). The transformation over the scoring

function 𝑓 2D

𝑖, 𝑗 with all 𝑀 sampled pixels between frame I𝑖 and I 𝑗 :

H𝑖, 𝑗 =

H𝑚
𝑖, 𝑗 ,

∑︁

𝑚

𝑖, 𝑗 (𝑠𝑖, 𝑠 𝑗 , 𝑟𝑖 | P𝑖, P 𝑗 ) = H𝑖, 𝑗 (𝑥, 𝑦),
𝑓 2D

(6.18)

where 𝑥 and 𝑦 are functions of 𝑠𝑖, 𝑠 𝑗 , 𝑟𝑖. Eq. (6.1) becomes:

𝜙(P) = max
S,R

∑︁

∑︁

𝑖

𝑗, 𝑗≠𝑖

H𝑖, 𝑗 (𝑥(S, R), 𝑦(S, R)).

(6.19)

In our implementation, we discretize H𝑖, 𝑗 to a 2D matrix.

Accelerate Bundle-Adjustment Consensus. The BA determines 𝑁 − 1 camera scales and 𝑁 − 1

depth adjustments to maximize the scoring function 𝜙(·) in Eq. (6.19). With Hough transform,

BA maximizes the summarized intensity via indexing 𝑁 × (𝑁 − 1) Hough transform matrices H.

It avoids BA repetitively enumerating all sampled pixels. Fig. 6.6 shows an example optimization

process.

Certified Global Optimality of robust inlier-counts scoring function Eq. (6.5) and Eq. (6.6) are

achieved after optimization. See Fig. 6.8 for more analysis.

Optimization with RGB-D. With GT depthmap, the algorithm switches to the 3D scoring function

𝑓 3D
𝑖, 𝑗 (·). The depth adjustment is fixed to 1 and the 2D line-circle intersection becomes 3D line-

sphere intersection.

6.3.1.3 Computational Complexity

Naive Time Complexity. From Eq. (6.2) and Fig. 6.4, in each epoch, we evaluate (𝑁 − 1) (𝐾 − 1)

pose groups with Hough Transform Acceleration. Suppose each group takes 𝑇 iterations to optimize

Eq. (6.19), the time complexity is:

O ((𝑁 − 1)(𝐾 − 1) · 𝑁 (𝑁 − 1) · (𝑀 + 𝑇)),

(6.20)

where each group computes 𝑁 (𝑁 − 1) Hough matrices H. Each matrix enumerates 𝑀 sampled

pixels, see Eq. (6.18). Maximizing Eq. (6.19) becomes indexing H, hence has constant time

complexity 𝑇, where 𝑇 << 𝑀.

89

(a) Triangulation

(b) Geometric Verification

Figure 6.7 Triangulation optimizes frustum RF for multiview consistency w.r.t. depth and corre-
spondence. Geometric Verification inferences RF for sparse multiview consistent 3D points. For
simplicity, in (a), we only plot 𝐿𝑐 defined from the root frame.

Figure 6.8 Ablation Studies on the ScanNet.

Counting Unique Hough Matrices. Most computation is spent on Hough matrices. In Fig. 6.4,

each connection in the computation graph suggests two unique Hough matrices. We minimize time

complexity by only computing unique Hough matrices. In Fig. 6.4 first epoch, the initial optimal

∗

group P

has 𝑁 (𝑁 − 1) matrices. Each ablated group only differs by one pose, hence introducing

2(𝑁 − 1)(𝑁 − 1)(𝐾 − 1) matrices. The first-epoch complexity is then:

O𝐻 (𝑁 (𝑁 − 1) 𝑀 + 2(𝑁 − 1)2(𝐾 − 1)𝑀) + OBA(𝑁 (𝑁 − 1) (𝐾 − 1)𝑇).

(6.21)

Only the Hough transform is accelerated. As 𝑇 << 𝑀, the complexity of BA is neglectable. After

the first epoch, P

∗

only updates one pose per epoch, hence introducing 2(𝑁 − 2) (𝐾 − 1) matrices.

The complexity for the rest epochs is,

O𝐻 (2(𝑁 − 2)(𝐾 − 1) 𝑀) + OBA(𝑁 (𝑁 − 1) (𝐾 − 1)𝑇).

(6.22)

90

Dataset

ScanNet [52]

KITTI360 [147]

Method
ZoeDepth [19]
⌜ Ours
ZeroDepth [153]
⌜ Ours
Metric3D [303]
⌜ Ours
ZoeDepth [19]
⌜ Ours
ZeroDepth [153]
⌜ Ours
Metric3D [303]
⌜ Ours

Density

9.1%

5.6%

2.6%

4.0%

4.5%

3.2%

𝛿0.5
0.877
0.902
0.641
0.686
0.804
0.854
0.677
0.719
0.584
0.654
0.846
0.860

𝛿1
0.963
0.976
0.834
0.877
0.946
0.968
0.899
0.910
0.844
0.877
0.958
0.963

SIlog
6.655
5.901
12.860
9.463
6.708
4.170
14.154
13.220
16.468
13.881
9.226
8.896

A.Rel
0.056
0.050
0.124
0.106
0.067
0.055
0.103
0.094
0.132
0.115
0.072
0.068

S.Rel RMS RMSlog
0.075
0.154
0.016
0.070
0.149
0.014
0.152
0.337
0.086
0.133
0.295
0.067
0.084
0.150
0.020
0.068
0.125
0.014
0.153
3.521
0.490
0.145
3.499
0.474
0.183
3.486
0.819
0.164
3.395
0.772
0.104
2.194
0.508
0.101
2.139
0.487

Table 6.1 Self-Supervised Depth Estimation. We apply self-supervision with 5 frames via
executing the local SfM. We output improved sparse depthmaps over SoTA supervised inputs. The
evaluation is conducted over the root frame.

While Eq. (6.22) has linear complexity, our method only updates one pose per epoch. Updating

poses in all frames like other SfM methods is still quadratic.

6.3.2 Frustum Radiance Field Triangulation

Frustum Radiance Field. Now, we fix the optimized pose P∗. Then we employ a frustum

radiance field V of size 𝐻 × 𝑊 × 𝐷 for dense triangulation. Field V is defined over the root frame

I𝑜 and shares similarity with the categorical depthmap [79, 18]. We follow [277, 253] in rendering

the depth 𝑑. The RGB estimation is skipped as unrelated. A 3D ray originated from pixel p𝑖 at

frame 𝑖 is discretized into a set of 3D points and depth labels. With slight abuse of notation, we

denote { ˆp𝑖,𝑡 = o + 𝑑𝑡r | 𝑡 ∈ [1, 𝑇]}, where ˆp is a 3D point, 𝑑𝑡 is depth label and r is ray direction.

Set integration interval 𝛿𝑡 = 𝑑𝑡+1 − 𝑑𝑡, depth 𝑑 is:

𝑑 (p𝑖) =

∑︁

𝑡

𝛼𝑡 𝑑𝑡, 𝛼𝑡 = 𝑇𝑡 (1 − exp (−𝜎𝑡𝛿𝑡)), 𝑇𝑡 = exp(−

∑︁

𝑡′∈[1,𝑡]

𝜎𝑡′ 𝛿𝑡′).

(6.23)

We set the camera origin of frame 𝑖 as o. Instead of regressing occupancy 𝛿 with MLP [277, 253],

we directly interpolate the radiance field V:

𝛿𝑡 = V(𝑢, 𝑣, 𝑤), where

(cid:104)

𝑢 𝑣 𝑤

(cid:105) ⊺

= 𝜋(E, ˆp𝑖,𝑡).

(6.24)

Matrix E is the identity matrix. Function 𝜋(·) is projection function. Compared to using the MLP,

frustum radiance field V is more computationally efficient [78].

91

Triangulation. Classic triangulation method [209] operates on a single 3D point. The RF provides

additional constraints where all optimized points share a canonical 3D volume. In Fig. 6.7, we

supervise V for multi-view consistency between dense depthmap D and correspondence map C.

On depth:

𝐿 𝐷 =

1
𝑁 𝑀

∑︁

∑︁

𝑖

𝑚

∥𝜋(P𝑖, ˆp𝑚) − 𝑑𝑚

𝑖 ∥1.

(6.25)

Here, ˆp𝑚 is rendered from the root frame, following depth computed with Eq. (6.23). To apply

correspondence consistency, we have:

𝐿𝐶 =

1
𝑁 (𝑁 − 1)𝑀

∑︁

∑︁

∑︁

𝑖

𝑗, 𝑗≠𝑖

𝑚

∥𝜋(P 𝑗 , ˆp𝑚

𝑖 ) − q𝑚

𝑖, 𝑗 ∥1,

(6.26)

where ˆp𝑚

𝑖 = 𝜋-1(P𝑖, p𝑚

𝑖 = 𝜋(P𝑖, ˆp𝑚). With slight abuse of notation, function 𝜋(·) returns
depth for 𝐿 𝐷, and location for 𝐿𝐶. We always first render from the root frame and subsequently

𝑖 , 𝑑 (p𝑚

𝑖 )), p𝑚

project to 𝑁 frames. From there, we project to other supported frames again, forming 𝑁 (𝑁 − 1)

pairs.

6.3.3 Geometric Verification

With the RF optimized, we apply geometric verification to acquire sparse multi-view consistent

3D points, as in Fig. 6.7:

∑︁

C = {

𝑖 ≥ 𝑛c}, 𝑐𝑚
𝑐𝑚

𝑖 = 1 if

∑︁

∥ ˆp𝑚

𝑖 − ˆp𝑚 ∥2 ≤ 𝜆𝑐.

(6.27)

𝑖,𝑖≠𝑜
We follow the same rendering process as training, where ˆp𝑚
𝑖

𝑖,𝑖≠𝑜

is computed with Eq. (6.26). First, we

render 3D points from the root frame, project them to other views, and render 3D points from there

again. A point is valid if a minimum of 𝑛c views are consistent with the root.

6.4 Experiments

6.4.1 Self-supervised Depth Estimation

We benchmark whether self-supervision benefits supervised depth in unseen test data. For the

correspondence estimator, we use PDC-Net [251]. For depth estimators, we adopt recently pub-

lished in-the-wild depth estimator, including ZoeDepth [19], ZeroDepth [153], and Metric3D [303].

We evaluate with ScanNet [52] and KITTI360 [147] where all models perform zero-shot prediction.

92

Method
ZoeDepth [19]
⌜ Ours
ZeroDepth [153]
⌜ Ours
Metric3D [303]
⌜ Ours

𝛿1
𝛿0.5
0.658 0.894
0.793 0.942
0.351 0.589
0.490 0.725
0.533 0.753
0.664 0.838

SIlog
9.242
9.242
20.145
20.145
12.425
12.425

A.Rel
0.104
0.079
0.254
0.199
0.216
0.137

S.Rel RMS RMSlog
0.128
0.255
0.039
0.105
0.203
0.024
0.287
0.565
0.223
0.237
0.457
0.156
0.228
0.495
0.339
0.175
0.345
0.126

Table 6.2 Consistent Depth Estimation. We measure the numerical improvement by aligning
the support frame depthmaps to the root frame with our depth adjustment scalars. The evaluation
is conducted on support frames on ScanNet [52].

Method
PDC-Net [251]
⌜ LightedDepth [330]
⌜ Ours
RoMa [69]
⌜ LightedDepth [330]
⌜ Ours

Train Test

M

S

S

S

PCK-1
0.119
0.061
0.178
0.144
0.066
0.183

PCK-3
0.511
0.341
0.658
0.583
0.359
0.638

PCK-5 AEPE
4.612
0.743
6.590
0.563
2.898
0.866
3.333
0.815
5.974
0.588
3.067
0.844

Table 6.3 Self-Supervised Correspondence Estimation. We improve correspondence with RGB-
D inputs, using metrics from [251]. The entry train and test are training and testing datasets of
correspondence estimators. [Key: M=MegaDepth, S=ScanNet]

Test Data.

In dense correspondence estimation, methods [331, 253, 251] output confidence

score per correspondence. We follow [253, 251] to set a minimum threshold of 0.95. We run on

ScanNet test split and it returns 92 sequences with sufficient correspondence. We form our test

split by sampling 5 neighboring frames per valid sequence. Similarly, we run on KITTI360 data

and randomly select 100 × 5 test split, i.e., 100 sequences with 5 frames each. We consider it a

comprehensive experiment. Similar to SPARF [253], our triangulation trains a NeRF-like structure.

For reference, SPARF experiment on DTU dataset [118] includes only 15 sequences each with 3

images. In comparison, we include around 100 sequences.

Evaluation Protocols. We evaluate on root frame. We remove the scale ambiguity in the local

SfM system to correctly reflect depth improvement. Specifically, we adjust all 5 depthmaps by an

identical scalar computed between estimated root and GT depthmap, i.e., the median scaling [90].

This eliminates scale ambiguity in the root frame while preserving it in support frames.

Results.

In Tab. 6.1, our point cloud has a density of 2.6% − 9.1%, which amounts to 10 − 30k

points on a 480 × 640 image. On accuracy, we have unanimous improvement over all supervised

93

Figure 6.9 Self-supervised Correspondence Estimation enabled by our method with RGB-D
inputs. The correspondence error is marked by the radius of the circle.

models of both datasets. Especially, we outperform strong baselines of ZoeDepth on ScanNet and

Metric3D on KITTI360.

6.4.2 Consistent Depth Estimation

We evaluate on ScanNet. We follow Sec. 6.4.1 data split but evaluate the support frames.

Temporal consistent depth is essential for AR applications [162]. Tab. 6.2 reflects the performance

gain by aligning support frames to root with adjustments, which are jointly estimated with camera

poses, see Fig. 6.2 and Fig. 6.3.

6.4.3 Self-supervised Correspondence Estimation

Real-world image correspondence label is expensive, e.g. KITTI provides only 200 optical flow

labels. Existing datasets, such as MegaDepth and ScanNet, require large-scale 3D reconstruction

with manual verification. Hence, correspondence estimators can not fine-tune on general RGB-D

datasets like NYUv2 [222] or KITTI [83]. But our method enables self-supervised correspondence

estimation on RGB-D data when using 3D scoring function Eq. (6.6). The camera poses are

optimized with the point cloud specified by depthmap and correspondence. The accurate pose in

turn improves projective correspondence. In Tab. 6.3, with 5 RGB-D frames, our method improves

projective correspondence over inputs. We use the same test split as Sec. 6.4.1. The evaluation

accumulates correspondence of each frame pair. Fig. 6.8a shows our improvement is unanimous

over both confident and unconfident estimation. A visual example is in Fig. 6.9.

6.4.4 Sparse-view Pose Estimation

Comparison with Optimization-based and Learning-based Poses.

Previous studies either

evaluate two-view pose [240, 93], or SLAM-like odometry [273]. For more comparison, following

94

Frames Method

5

COLMAP [209]
Ours
DeepV2D [240] - ScanNet
DeepV2D [240] - NYUv2
DeepV2D [240] - KITTI
LightedDepth [330]
DRO [93] - ScanNet
DRO [93] - KITTI
DUSt3R [273] w.o. Intrinsic
DUSt3R [273] w.t. Intrinsic
Ours

Suc. (%)
36.7
100.0

100.0

Zero-shot
✓
✓
✗
✓
✓
✓
✗
✓
✓
✓
✓

PCK-3 C3D-3 Rot.
0.577
0.863
0.584
0.422
0.904
0.727
0.945
0.805
0.526
1.041
0.771
0.530
4.908
0.387
0.125
0.469
0.832
0.651
0.385
0.853
0.656
3.610
0.211
0.003
0.487
0.705
0.364
0.570
0.824
0.594
0.368
0.900
0.799

Trans.
1.296
1.062
1.496
1.568
4.231
1.550
1.200
5.469
2.074
1.759
1.120

Table 6.4 Sparse-view Pose Comparison with optimization-based and learning-based methods.
We only compare against COLMAP on its success sequences. Our method performs zero-shot
testing on ScanNet while outperforming DeepV2D [240], DRO [93] with ScaNet [52] in training
set. DUSt3R [273] trains on a similar dataset ScanNet++ [301].

Sec. 6.4.1 ScanNet split, we keep root frame and gradually add neighboring frames. In Tab. 6.4,

LightedDepth [330] and ours both use PDC-Net [251] correspondence and ZoeDepth [19] mono-

depth. COLMAP [209] uses PDC-Net correspondence. In evaluation, we follow [253] in aligning

to GT poses.

In Tab. 6.4, our zero-shot pose accuracy significantly outperforms all prior arts,

including [273, 93, 240] with ScanNet [52] or ScanNet++ [301] in their training set. See Supp.

for complete comparison from 3 to 9 frames. In Fig. 6.8, we attribute our superiority to certified

global optimality over robust measurements.

Comparison with NeRF-based Poses.

Sparse view NeRF methods optimize NeRF jointly

with camera poses, mandating a sophisticated and time-consuming optimization scheme. E.g.,

SPARF [253], takes one day to optimize the pose and NeRF. Typically, their poses are initialized

with COLMAP. Our method provides an alternative initialization with superior performance. In

Tab. 6.5, our initialization achieves better or on-par pose performance than SoTA [253] while only

taking ∼3 minutes (Fig. 6.7). Our lower performance on Replica dataset might be due to ZoeDepth

not being trained on synthetic data. Our work suggests the straightforward “first-pose-then-NeRF”

scheme also applies to short videos.

Certified Global Optimality . In Fig. 6.8b, our Bundle-RANSAC-Adjustment always finds more

inliers than groundtruth poses. To our best knowledge, we are the first work that extends RANSAC

to a multi-view system.

95

Method

Frames

BARF [148]
RegBARF [148, 180]
DistBARF [148, 11]
SCNeRF [119]
SPARF [253]
Ours

3

Replica [231]
LLFF [214]
Rot. Trans. Rot. Trans.
16.96
2.04
20.87
1.52
7.73
5.59
4.12
1.93
0.53
0.76
4.09
0.46

3.35
3.66
2.36
0.65
0.15
0.52

11.6
5.0
26.5
11.4
2.8
1.9

Table 6.5 Sparse-view Pose Comparison with NeRF-based methods following [253].

Run-time.

In Fig. 6.8c, we run approximately 3× slower than COLMAP. But both have quadratic

complexity. With 3/5/7/9 frames, we take 0.8/2.0/5.3/9.4 minutes on RTX 2080 Ti GPU, while

COLMAP uses 0.3/0.9/1.8/3.6 minutes on Intel Xeon 4216 CPU. COLMAP runs sequentially.

But our method is highly parallelized. Our core operation Hougn Transform scales up with more

GPUs.

6.5 Conclusion

By revisiting self-supervision with local SfM, we first show self-supervised depth benefits

SoTA supervised model with only 5 frames. We have SoTA sparse-view pose accuracy, applicable

to NeRF rendering. We have diverse applications including self-supervised correspondence and

consistent depth estimation.

Limitation.

The NeRF-like triangulation constrains our method from applying to large-scale

self-supervised learning. Its efficiency requires improvement.

96

CHAPTER 7

MOTION-FROM-STRUCTURE: LEVERAGING MONOCULAR DEPTH PRIORS FOR
MULTI-VIEW TASKS

Structure-from-Motion (SfM) is a classical 3D vision task for recovering camera parameters and

scene geometry from multi-view images. Recent advances in deep learning and vision foundation

models have led to more robust monocular depth estimation (MDE) models that can directly predict

structure from a single image without relying on camera motion. However, using MDE in SfM

remains challenging due to its high error variance and the need for affine corrections. While prior

works have incorporated MDE into SfM pipelines, it is generally used only to initialize sparse

keypoints, discarding most of its dense predictions.

In this paper, we introduce the notion of

Motion-from-Structure (MfS), which fully leverages the density of monocular depth priors to infer

camera motion. By reformulating bundle adjustment to distinguish inlier and outlier depth pixels,

we eliminate the need for per-pixel adjustments and offer a plug-and-play method that integrates

seamlessly with arbitrary MDE models. We show the efficacy of our approach on multi-view tasks,

including pose estimation, structure-from-motion, and camera re-localization. Our method achieves

state-of-the-art results on camera pose estimation, efficiently scaling to thousands of frames and

highlighting the potential of MDE for multi-view tasks.

7.1

Introduction

Structure-from-Motion (SfM) is a cornerstone of 3D computer vision for estimating camera

intrinsics and extrinsics from image collections. Its versatility has fueled applications across diverse

domains, including 3D reconstruction [81], neural rendering [170], camera re-localization [86], and

robot navigation [96]. Traditional SfM methods [211] operate by jointly optimizing camera motion

and 3D point positions, relying on sparse feature correspondences. However, these methods often

struggle with scenes lacking sufficient texture or with large baseline motions, leading to potential

degeneracy and inaccurate results.

The advent of deep learning has revolutionized monocular depth estimation (MDE) [190, 19],

enabling the direct inference of dense depth maps or point clouds from single images, indepen-

97

Figure 7.1 Motion-from-Structure from Monocular Depth. (Left) We directly estimate camera
parameters given monocular depthmaps while jointly optimizing affine depth corrections. Unlike
methods that use depthmaps for SfM initialization, our method avoids per-pixel adjustments and
network fine-tuning, extending arbitrary monocular depth estimation models to multi-view tasks.
(Right) We challenge whether SfM triangulation consistently improves monocular depth, partic-
ularly with limited motion parallax and scene texture. We evaluate a “lower-bound” approach
side-stepping SfM triangulation by relying on robust monocular networks, and found it performs
surprisingly well.

dently of camera motion. This rich structural prior was shown to benefit various downstream

applications [51, 215]. However, leveraging MDE for multi-view tasks received less attention.

While some recent works [26, 20, 227, 66] have explored integrating MDE into SfM pipelines, they

typically use it to initialize sparse keypoints, discarding its dense predictions and relying heavily

on refinement with traditional bundle adjustment.

The performance of SfM-derived point clouds can be scene-dependent, sometimes failing to

surpass the quality of monocular depth maps (Figure 7.1). This observation motivates our “Motion-

from-Structure" approach, which aims to leverage the dense structural information provided by

MDE to directly recover camera motion, effectively side-stepping the triangulation step [102,

12]. This approach has several key advantages:

it establishes a robust “lower bound” for pose

estimation, mitigating degeneracy issues inherent in traditional SfM; and it effectively aligns

individual monocular depth maps into a coherent 3D scene representation. Unlike prior methods

that rely on per-pixel depth adjustments and network parameter fine-tuning, our method offers a

plug-and-play solution that can seamlessly integrate with any MDE model (see Table 7.1).

98

Optimization

Ace-Zero [26] FlowMap [227] VGGSfM [267] MASt3R-SfM [66] Ours

Network Forward
Network Backward
Pixel-wise Depth

✓
✓
✓

✓
✓
✓

✗
✗
✓

✗
✗
✓

✗
✗
✗

Table 7.1 Our method relies solely on depthmap pointcloud without adjusting network parameters
and pixel-wise depth values.

A key challenge of utilizing monocular depth maps in conventional SfM lies in adapting methods

optimized for sparse, accurate point clouds to leverage the the dense, high error estimates of MDE.

Prior methods [267, 227, 66, 26] pre-select accurate depth pixels through neural guidance, which

involves training a network to predict noise measurements. Neural guidance, while effective, still

requires optimizing network parameters during bundle adjustment, leading to increased memory

consumption and hindering scalability [227, 66, 186]. A second challenge in aligning independent

depth maps for multi-view images is the necessity of optimizing an affine depth correction per

image [20, 308].

To estiamte camera parameters with dense but noisy depth maps, while jointly optimizing the

required affine depth correction, we use a robust inlier-counting score inspired by RANSAC [77].

Our bundle adjustment maximizes the projective inliers between depth and correspondence maps.

To address the non-differentiability and threshold sensitivity associated with inlier counting, we

compute inliers across all thresholds, transforming the discrete RANSAC process into a contin-

uous cumulative distribution function (CDF) [9]. This allows us to naturally represent the noise

measurement of each depth pixel as a probability derived from the CDF and its corresponding

projective residual, resulting in a smooth, differentiable, and robust optimization.

Our proposed projective inlier function is flexible and compatible with robust loss functions

from prior work [211], offering a plug-and-play framework that extends arbitrary monocular depth

networks to large-scale multi-view 3D vision tasks. Our main contributions are three-fold:

1. A novel bundle adjustment algorithm that can efficiently handle the high noise and affine

ambiguities of dense monocular depth maps. (Sec. 7.3.2)

2. An effective SfM framework successfully leveraging arbitrary MDE to multi-view 3D vision

99

tasks. (Sec. 7.3.3)

3. State-of-the-art performance in camera pose estimation and re-localization across multiple

datasets. (Sec. 7.4)

7.2 Related Work

Foundation Models in Multiview 3D Vision. Efforts to develop foundation models for

monocular depthmap estimation [190] and binocular correspondence estimation [70, 251] have been

ongoing. Pioneering studies [274, 140, 315] have unified monocular and binocular tasks within

a binocular pointmap estimation framework. They demonstrate its potential for tackling multi-

view 3D vision challenges, including camera extrinsic and intrinsic estimation. Their formulation

nevertheless includes an optimization process to convert the dense network prediction to low DoF

camera parameters. Our method benefits them with an enhanced optimization objective function

specifically designed for dense and high-variance deep network outputs. Beyond that, our work

encourages the community to reconsider the merits between depthmap and pointcloud network as

monocular depth networks [190] show equal performance with pointcloud network [66].

RANSAC. RANdom Sample Consensus (RANSAC) algorithms [9] aim at robust low-DoF param-

eter estimation in the presence of noisy data. Our work similarly handles noisy input as consuming

high-variance network predictions instead of an accurate sparse point cloud. Several RANSAC

works [246, 9] focus on improving the scoring function via generalizing from binary [77] to con-

tinuous values. Among them, MAGSAC [9] can be considered a special case of our algorithm

with an added assumption that residuals follows a truncated chi-squared distribution. Unlike [9],

we leverage dense predictions, and specifically the induced residual distribution, from pre-trained

monocular depth models to derive an improved scoring function.

SfM with Deep Learning. There have been several pioneering works combine deep learning with

SfM [267, 66, 26]. [26, 227] include the network backpropagation during SfM Bundle-Adjustment.

VGGSfM [267] instead formulates SfM BA as a network forward process. However, due to higher

computational complexity, both strategies either limit the network size or the scale of SfM. Our

100

Figure 7.2 Given (a) input frames, our method consumes their (b) dense correspondence with
confidence scores and (c) depthmaps.

method takes a different approach by decoupling network inference from SfM BA. This enables our

method to benefit ongoing developments of vision foundation models for the SfM process. Finally,

unlike [66], our method supports any monocular depth network beyond MASt3R. Our method

highlights the potential of leveraging monocular networks for the SfM problem. Note, despite our

similar name to [263], we address a different problem.

7.3 Method

To leverage depth priors for multi-view tasks, we initialize depthmaps and dense correspondence

maps using a monocular depth estimator (i.e. , ZoeDepth [19]) and a binocular correspondence

estimator (i.e. , RoMa [70]). An example is in Fig. 7.2. We sub-sample the dense correspondence

map into points to initialize two-view odometry. Our method performs hierarchical Bundle Adjust-

ment, starting from a coarse stage (Sec. 7.3.3.2) to a fine stage (Sec. 7.3.3.3). Fig. 7.3 compares

our algorithm to conventional SfM pipelines.

7.3.1 Overview

Problem Definition. Given as input an unordered collection of 𝑁 frames {𝐼𝑖}𝑖∈[𝑁], we optimize

for camera intrinsics K = {K𝑖} and extrinsics P = {P𝑖}. Using a pre-trained depth network ND

and a correspondence network NC, we extract 𝑁 depthmaps D = {D𝑖 = ND(𝐼𝑖)} and pairwise

correspondence maps C = {C𝑖, 𝑗 = NC(𝐼𝑖, 𝐼 𝑗 ), 𝑖 ≠ 𝑗 }. We jointly optimize per-frame affine

corrections A = {𝛼𝑖, 𝑏𝑖 | 𝑖 ≤ 𝑀 }, producing aligned depth maps D′

𝑖 = 𝛼𝑖 · D𝑖 + 𝑏𝑖.

Optimization.

Let X = {P, K, A} denote the set of all variables to optimize, and X𝑖 =

101

Algorithm 7.1 SfM Pipeline

Algorithm 7.2 MfS Pipeline

1: Input: Image set
2: Output: Camera poses and 3D points
3: Sparse correspondence
4: Two-view geometry estimation
5: Incremental SfM:
6:
7:
8:

- Iterative register new images
- Triangulate new 3D points
- Bundle adjustment to refine struc-

ture and poses

1: Input: Image set
2: Output: Camera poses
3: Dense depth & correspondences
4: Two-view geometry estimation
5: Coarse Stage:
6:
7: Fine Stage:
8:
9:

- Active Sampling.
- Optimize global Euclidean CDF

- Optimize sub-graph log-CDF scores.

9: Final Optimization: Global bundle

scores.

adjustment

10: Output: Optimized poses and 3D

points

10: Output: Optimized poses

Figure 7.3 Comparison of conventional SfM pipelines, e.g., COLMAP [211], to the proposed MfS
approach.

(cid:0)P𝑖, K𝑖, A𝑖(cid:1). We formulate this optimization as maximizing a scoring function S

We define S as a summation of a suitable quality function Q over frame pairs (𝐼𝑖, 𝐼 𝑗 ) in the pose

X∗ = arg max
X=(P,K,A)

S(X | D, C).

(7.1)

graph G (Sec. 7.3.3.1)

S(X | D, C) =

1
𝑀

∑︁

(𝑖, 𝑗)∈G

Q (X𝑖, X 𝑗

| D𝑖, D 𝑗 , C𝑖, 𝑗 ),

(7.2)

where 𝑀 ≫ 1 is the number of sampled correspondences.

Three-Stage SfM Pipeline. Our SfM runs in three stages: (1) initialization, (2) coarse-stage SfM,

and (3) fine-stage SfM. The coarse-stage SfM focuses on robustness by roughly aligning images

by randomly sampling depth and correspondence pixels. The fine-stage SfM refines camera poses,

prioritizing pixels with lower reprojection errors. The following sections begin with the core of

our algorithm: the Bundle-Adjustment process for inlier-outlier separation, followed by detailed

discussion of initialization, coarse-stage SfM, and fine-stage SfM.

7.3.2 Separate Inliers from Outliers in BA

Motivation. As the reader may notice, in Sec. 7.3.1, we use the term “maximizing the scoring

function” instead of the more common “minimizing a loss function" found in other SfM litera-

102

ture [186]. This choice emphasizes our connection to RANSAC methods, as both approaches

focus on optimizing low DoF camera parameters from densely noisy inputs. Specifically, we as-

sume dense monocular depthmaps contain sufficient inliers to support camera localization, despite

mixed with outliers. Thus, the Bundle-Adjustment is designed to maximize inliers from dense

depthmaps. After presenting the preliminaries, this subsection starts with a naive yet robust binary

scoring function. However, the non-differentiability of the binarized function poses challenges for

Bundle-Adjustment. To address this, we generalize it to a smooth form by leveraging depthmap

density.

Sampling Depth and Correspondence. We only consider pairs of frames (𝐼𝑖, 𝐼 𝑗 ) with a co-

visibility score at least 𝜈, defined as the percentage of pixels visible in both frames. For each

co-visible frame pair, we downsample the dense full-resolution depthmaps and correspondence

maps to a fixed number of pixels 𝜅. Specifically, between frame 𝑖 and 𝑗, we sample 𝜅 depth pixels

on frame 𝑖 and 𝜅 𝑖-to- 𝑗 correspondence pixels. We only sample correspondences with a confidence

score at least 𝜒.

Projective Residuals. We define the residual 𝑟𝑖, 𝑗,𝑘 as the 2D discrepancy in the 𝑘 th sampled

correspondence 𝑐𝑖, 𝑗,𝑘 ∈ C𝑖, 𝑗 . Denoting 𝑐𝑖, 𝑗,𝑘 as ( 𝑝𝑖, 𝑗,𝑘 , 𝑞𝑖, 𝑗,𝑘 ) ∈ 𝐼𝑖 × 𝐼 𝑗 , we write

𝑟𝑖, 𝑗,𝑘 = (cid:13)

(cid:13)𝜋𝑖→ 𝑗 (cid:0)D′

𝑖 [ 𝑝𝑖, 𝑗,𝑘 ](cid:1) − 𝑞𝑖, 𝑗,𝑘

(cid:13)
(cid:13)2

,

(7.3)

where the operator 𝜋𝑖→ 𝑗 projects the pixel 𝑝𝑖, 𝑗,𝑘 in frame 𝐼𝑖, with its corrected depth value in D′
𝑖,

to frame 𝐼 𝑗 . The projection is defined by the camera intrinsics K𝑖, K 𝑗 and extrinsics P𝑖, P 𝑗 [99].

Other robust norms may also be used in Eq. (7.3), e.g., the Cauchy function used in [211].

Residuals to Binary Scoring Function. Given a residual threshold 𝜏, we realize Eq. (7.2) by

setting Q := 𝑄b

𝜏 where

𝑄b

𝑡 (X𝑖, X 𝑗

| D𝑖, D 𝑗 , C𝑖, 𝑗 ) = (cid:205)𝑘 1[𝑟𝑖, 𝑗,𝑘 < 𝜏],

(7.4)

where 1(·) is the indicator function. Intuitively, a depth pixel is considered an inlier if its projective

residual is below the threshold 𝜏. The binarized scoring function in Eq. (7.4) is widely used

in RANSAC algorithms [77] for its superiority in managing densely noisy inputs. However, the

103

RANSAC algorithm is mostly applied to problem of low Degree-of-Freedom (DoF), e.g., essential

and fundamental matrix estimation [181, 100]. In contrast, the multi-view pose estimation problem

has a significantly larger solution space. This necessitates a continuous scoring function to enable

first-order and second-order optimization methods.

Binary Scoring Function to CDF. The dense depthmaps provide enough samples of projective

residuals to utilize their distributional properties, leveraging the deep priors of the pre-trained MDE

model. Letting 𝑅 denote the set of all residuals at the current epoch, we model the residual 𝑟 as

a random variable following an empirical distribution R we obtain by kernel density estimation

(KDE) [223, 126], i.e.,

𝑟 ∼ R = KDE(𝑅), 𝑅 = {𝑟𝑖, 𝑗,𝑘 | (𝑖, 𝑗) ∈ G, 𝑘 ∈ [𝜅]}

(7.5)

Taking inspiration from MAGSAC [9], we smooth out the binary scoring function Eq. (7.4) with a

threshold 𝜏 as:

S𝜏 (X | D, C) =

1
𝑀

∑︁

𝑖, 𝑗,𝑘

1(𝑟𝑖, 𝑗,𝑘 < 𝜏)

≈ 1 ·

∫ 𝜏

0

𝑝(𝑟) d𝑟 + 0 ·

∫ +∞

𝜏

𝑝(𝑟) d𝑟 = 𝐹 (𝜏),

(7.6)

where 𝑝(𝑟) and 𝐹 (𝜏) = Pr[𝑟 < 𝜏] are the probability and cumulative distribution function (CDF)

of R, respectively.

Beyond Binary Scoring Function. Dense depthmaps contain depth pixels with varying noise

levels. Intuitively, a large threshold 𝜏 in Eq. (7.6) encourages to register camera at an approximately

correct location. A small threshold 𝜏 in Eq. (7.6) improves accuracy but risks local minima. To

fully leverage dense depthmaps, we extend scoring function Eq. (7.6) beyond a single threshold by

integrating up to a maximum 𝜏max as:

S(X | D, C) =

∫ 𝜏max

0

𝑝(𝑡) · S𝑡 (X | D, C) d𝑡.

(7.7)

Intuitively, Eq. (7.7) extends Eq. (7.6) by summing over infinitely many thresholds. Crucially,

thresholds are sampled according to the natural residual distribution R induced by the rich depth

104

Algorithm 7.3 Forward and Backward BA Scoring Function
1: R = {𝑝, 𝐹} := KDE(𝑅)
(cid:205)𝑖, 𝑗,𝑘 𝐹 (𝑟𝑖, 𝑗,𝑘 ) · 1[𝑟𝑖, 𝑗,𝑘 < 𝜏max]
2: S = 1
𝑀
𝜕
𝜕xS = 1
3:
𝑀

(cid:205)𝑖, 𝑗,𝑘 𝑝(𝑟𝑖, 𝑗,𝑘 ) · 𝜕
𝜕x

𝑟𝑖, 𝑗,𝑘

⊲ smooth score
⊲ forward
⊲ backward

priors from the pre-trained MDE model. Formally, the proposed scoring function is:

S(X | D, C) =

1
𝑀

∑︁

𝑖, 𝑗,𝑘

𝐹 (𝑟𝑖, 𝑗,𝑘 ) · 1[𝑟𝑖, 𝑗,𝑘 < 𝜏max].

(7.8)

Distinguishing Inliers from Outliers. From Eq. (7.8), the BA process naturally differentiates

inliers from outliers by assigning higher values to depth pixels with smaller residuals while down-

weighting those with larger residuals. Fig. 7.4 illustrates how the BA process differentiates inliers

from outliers. Further, Eq. (7.8) inherits the robustness. For instance, applying Eq. (7.8) to update

the example variable x, e.g., camera rotation component, by computing its gradient:

𝜕
𝜕x

𝐹 (𝑟𝑖, 𝑗,𝑘 ) = 𝑝𝑟 (𝑟𝑖, 𝑗,𝑘 ) ·

𝜕
𝜕x

𝑟𝑖, 𝑗,𝑘 ,

(7.9)

where gradient of extreme residual values, i.e., those with low probability, is suppressed. Finally,

after optimization, the noise level of a depth pixel is represented by its residual’s probability.

Algorithm 7.3 provides a succinct summary of the proposed Eq. (7.7) scoring function.

Scalability. Our approach is highly parallelizable, making it suitable for large-scale SfM, thanks

to its efficient data structure consisting of simple sets of depth and correspondence pairs. This is

in contrast to traditional approaches requiring full 3D point clouds, which introduces complex for

parallel processing, and more recent methods which run out of memory upon processing upwards

of 200 views, e.g., FlowMap [227] and VGGSfM [267] as reported in [66].

7.3.3 SfM Pipeline

The subsection outlines the proposed SfM process, including initialization, coarse-stage SfM,

and fine-stage SfM.

7.3.3.1

Initialization

Pose Graph. We construct a weighted undirected graph G using correspondence maps C. Each

edge 𝑔𝑖, 𝑗 ∈ G is defined as the visibility between frame 𝑖 and 𝑗, i.e., the percentile of pixels visible

105

in both frames.

Intrinsic Initialization.

In each frame, we use [274] to extract the dense pointcloud estimation.

Next, the dense pointcloud is converted to an incidence field, where we apply the RANSAC intrinsic

calibration method proposed in [328]. If a shared intrinsic is assumed for the input image collection,

we initialize it as the median.

Extrinsic and Depth Adjustments Initialization. We adopt a greedy strategy to initiate a

spanning tree from the pose graph G. The root node is chosen as the one with the highest degree.

A new node is added such that it maximizes the total degree of the graph after its inclusion. E.g.,

when frame 𝑖 is added, its extrinsic P𝑖 and depth scale adjustment 𝑠𝑖 are simultaneously initialized.

The depth bias adjustment 𝑏𝑖 is initialized to 0.

7.3.3.2 Coarse-Stage SfM

Logged Residual. As in Fig. 7.5, coarse stage prioritizes to register frames with an approximately

correct location to avoid local minimum. We apply logarithm operation to the L2 norm residual in

Eq. (7.3) to enhance robustness:

𝑖, 𝑗,𝑘 = log(1 + 𝑟𝑖, 𝑗,𝑘 ).
𝑟 l

(7.10)

Graph Decomposition. Suppose the frame 𝑖 is poorly registered, its corresponding residual 𝑟𝑖, 𝑗,𝑘

exhibits significantly large values. Due to the robustness property of Eq. (7.8), the residuals with

larger values are automatically assigned lower weights and smaller gradients. These characteristics

cause poorly registered frames to become "stuck" in a local minimum. We propose a graph

decomposition strategy to mitigate the occurrence of early local minima. For the graph G, we

decompose it into a 𝑁 subgraphs G𝑖:

G =

∑︁

𝑖∈𝑁

G𝑖, G𝑖 = {X𝑖, E𝑖},

(7.11)

where X𝑖 = {I𝑖} ∪ N (I𝑖), and E𝑖 = {(I𝑖, I 𝑗 ) | I 𝑗 ∈ N (I𝑖)}. Each subgraph G𝑖 is a directed graph,

includes the 𝑖-th frame I𝑖 and its neighbouring frames N (𝑣𝑖). Correspondingly, the Eq. (7.8)

106

Figure 7.4 Distinguishing Inliers from Outliers with Bundle-Adjustment: Distributions of pro-
jective residuals before and after BA show residuals shifting towards zero, indicating the system is
selecting more inliers.

scoring function is formulated as:

S𝑑 (P, K, A | D, C) =

1
𝑁

∑︁

𝑖, 𝑗,𝑘

𝜙c(𝑟 l

𝑖, 𝑗,𝑘 ),

(7.12)

where 𝜙c(𝑟 l

𝑖, 𝑗,𝑘 ) = 𝐹𝑟𝑖 (𝑟 l

𝑖, 𝑗,𝑘 ), and 𝑟 l

𝑖 ∼ S(R𝑖). For each logged residual 𝑟 l

𝑖, 𝑗,𝑘 from frame I𝑖, we obtain

its CDF using the distribution computed only with the subgraph G𝑖.

7.3.3.3 Fine-Stage SfM

From Random to Active Sampling.

Our method assumes accurate 3D pointclouds from

depthmap estimation for intrinsic and extrinsic calibration. However, the random sampling strat-

egy in Sec. 7.3.3.1 still includes noisy depth pixels. While the robust scoring function Eq. (7.8)

suppresses noisy pixels, actively sampling accurate ones could further improve performance. There-

fore, in the fine-stage SfM, we prioritize depth pixels with smaller residuals. First, we accumulate

pair-wise residuals as follows:

𝑓 (𝑑𝑖,𝑚) =

1
∥N (I𝑖) ∥

∑︁

𝑟𝑖, 𝑗,𝑘 .

𝑗 ∈N (I𝑖)

107

(7.13)

(a) Initialization

(b) Coarse Stage

(c) Fine Stage

Figure 7.5 Hierarchical Bundle Adjustment (BA). We visualize our coarse-stage (Sec. 7.3.3.2)
and fine-stage (Sec. 7.3.3.3) BA process using the 7-Scenes dataset [219] - sequence “Stairs”. With a
poor initialization, the coarse-stage Bundle Adjustment registers camera poses to an approximately
correct location. Then, the fine-stage optimization further improves pose accuracy.

Scene

courtyard
delivery area
electro
facade
kicker
meadow
office
pipes
playground
relief
relief 2
terrace
terrains

Average

COLMAP [210] ACE-Zero [26]
RTA
RRA
1.9
56.3
1.9
34.0
7.9
53.3
64.1
92.2
16.8
87.3
0.9
0.9
0.0
36.9
1.1
30.8
2.6
17.2
17.0
16.8
5.6
11.8
2.0
100.0
4.5
100.0
9.7
49.0

RTA
60.0
28.1
48.5
90.0
86.2
0.9
32.3
28.6
18.1
16.8
11.8
100.0
99.5
47.8

RRA
4.0
27.4
16.9
74.5
26.2
3.8
0.9
9.9
3.8
16.8
7.3
5.5
15.8
16.4

FlowMap [227] VGGSfM [267] DF-SfM [105] MASt3R-SfM [66]
RRA
7.5
29.4
2.5
15.7
1.5
3.8
0.9
6.6
2.6
6.9
8.4
33.2
12.3
10.1

RRA
50.5
22.0
79.9
57.5
100.0
100.0
64.9
100.0
37.3
59.6
69.9
38.7
70.4
65.4

RRA
89.8
83.1
100.0
74.3
100.0
58.1
100.0
100.0
100.0
34.2
57.4
100.0
58.2
81.2

RRA
80.7
82.5
82.8
80.9
93.5
56.2
71.1
72.5
70.5
32.9
40.9
100.0
100.0
74.2

RTA
64.4
81.8
95.5
75.3
100.0
58.1
98.5
100.0
93.6
40.2
76.1
100.0
52.5
79.7

RTA
51.2
19.6
58.6
48.7
97.8
96.2
42.1
97.8
40.8
57.9
70.3
29.6
54.9
58.9

RTA
3.6
23.8
1.2
16.8
1.5
2.9
1.5
12.1
2.8
7.7
2.8
24.1
13.8
8.8

RTA
74.8
82.0
81.2
82.6
91.0
58.1
54.5
61.5
70.1
32.9
39.1
99.6
91.9
70.7

MfS (Ours)
RTA
RRA
94.7
94.7
83.0
83.1
78.2
95.6
99.2
100.0
98.9
100.0
58.1
100.0
86.2
100.0
96.7
100.0
93.8
94.7
98.9
100.0
98.9
100.0
100.0
100.0
95.4
100.0
90.9
97.5

Table 7.2 Multi-view pose estimation benchmark on ETH3D dataset [213, 212] in terms of RRA
(@5) and RTA (@5). (sparse-set SfM)

Eq. (7.13) calculates the average residual of each depth pixel across its connected frame pairs. Next,

we employ a Non-Neighborhood Suppression strategy, similar to Non-Maximum-Suppression

(NMS) in detection literature. We begin by sampling depth pixels with the smallest residuals,

excluding their neighbors as each is selected. In summary, the proposed active sampling utilizes

depthmap density to approximate the triangulation process in classic SfM literature. We assume

that depth pixels within a spatial neighborhood inherently capture its variance.

Residual and Pose Graph. We change the Eq. (7.3) residual to its simple L2 norm. Meanwhile,

we define the pose graph to include all images as in Eq. (7.8). Also see Fig. 7.5.

108

7.4 Experiments

We demonstrate the efficacy of our method through evaluations on two fundamental 3D vision

tasks: structure-from-motion (SfM) and camera re-localization.

7.4.1 Datasets

We distinguish two types of SfM datasets, selecting a representative of each. We denote

sparse-set datasets as those with minimal visual overlap between frames, selecting the ETH3D

dataset [213, 212], following MASt3R-SfM [66]. We denote dense-set datasets as those with high

amounts of visual overlap between frames, typically present in video sequences with hundreds to

thousands of frames. Due to their scale, dense-set data poses significant challenges for traditional

feature-matching approaches. Here we select the ScanNet dataset [53]. Its ground-truth odometry

enables direct comparison with COLMAP [210] for both calibrated and uncalibrated camera lo-

calization settings. From the 100 ScanNet test sequences, we sample at 5 FPS then select the 71

sequences where the frames do not exceed 2500, ensuring that preprocessing remains manageable

and COLMAP [210] runs successfully. For camera re-localization, we use the standard 7-scenes

dataset [219] following the protocol of marepo [41].

7.4.2

Implementation Details

We parameterize camera pose with a 9D rotation matrix following SPARF [253]. Across exper-

iments, for the coarse-stage sub-graph CDF scoring function, we include pixels with a reprojection

error smaller than exp(15) in L2-norm. In the fine-stage CDF scoring function, we include pixels

with a reprojection error below 20 for ScanNet and 7-Scenes, and below 35 for ETH3D to accom-

modate its high-resolution images. We use the Adam optimizer [127] for 50, 000 iterations with

a learning rate of 1𝑒−4. Within each pair of frames, we sample 𝜅 = 300 pixels. The range of

Neighborhood Suppression strategy in active sampling is set to N (I) = 8. We exclude an image

pair if less than 𝜈 < 15% of its pixels are co-visible. During pre-processing, we sample only from

the dense correspondence map where the confidence scores exceed 𝜒 > 0.3.

109

Method

Depth

Corres.

COLMAP [210]

-

SuperPoint [63]

Ours

ZoeDepth [19]
RoMa [70]
RoMa [70]
DUSt3R [274]
UniDepth [190] RoMa [70]
DUSt3R [274] MASt3R [140]
UniDepth [190] MASt3R [140]

GLOMAP [186]
Ours

SuperPoint [63]
-
DUSt3R [274] MASt3R [140]

Calibrated

Uncalibrated

Acc@3◦ Acc@5◦ Acc@10◦ Acc@3◦ Acc@5◦ Acc@10◦
0.398
0.396
0.426
0.432
0.432
0.439
0.067
0.432

0.342
0.372
0.403
0.407
0.384
0.393
0.062
0.407

0.670
0.811
0.820
0.823
0.811
0.817
0.331
0.825

0.783
0.823
0.830
0.833
0.837
0.841
0.347
0.836

0.589
0.614
0.631
0.636
0.639
0.645
0.160
0.639

0.505
0.586
0.615
0.612
0.596
0.598
0.148
0.621

Table 7.3 Structure-from-motion benchmark on the ScanNet dataset [53]. (dense-set SfM)

7.4.3 Structure-from-Motion Evaluations

We evaluate SfM performance on both sparse-set and dense-set datasets, following the MASt3R-

SfM evaluation protocol and metrics [66].

Sparse-Set. As shown in Table 7.2, our method achieves state-of-the-art performance on ETH3D

with a significant improvement over competing baselines. Notably, our method achieves 100%

in RRA on 10/13 and 95% in RTA on 9/13 scenes, with expected improvement at @3 and @1

benchmarks.

Dense-Set.

To highlight the plug-and-play modularity of our method, we employ a variety of

depth and correspondence estimators, comparing against COLMAP [211] and GLOMAP [186].

We report results in Table 7.3, observing superior performance in all configurations. We generally

obtain the best performance with UniDepth as depth estimator, and RoMa and MASt3R as cor-

respondence estimators in the uncalibrated and calibrated regimes respectively. We observe that

textureless ScanNet challenges GLOMAP’s global registration strategy, creating a gap to COLMAP.

These results indicate that the rich information from monocular depth priors enable our proposed

MfS approach to achieve precise pose estimation beyond the best classical approaches, even in large-

scale scenarios.

7.4.4 Camera Re-Localization Evaluations

Recall that camera re-localization is the task of processing a collection of mapping images with

known camera poses to enable accurate pose estimation of new query images. Several approaches

have been proposed for this challenging task, starting from geometric methods based on indexing

110

Category

Method

FM

E2E

SCR

APR

AS [207]
HLoc [203]

SC-wLS [288]
NeuMaps [238]
PixLoc [206]

ACE [22]
DSAC* [25]
HSCNet [145]
HSCNet++ [272]

Direct-PN [43]
DFNet [42]
LENS [172]
marepo [41]

FoundationMDE MfS (Ours)

Chess
4/1.96
2/0.79
3/0.76
2/0.81
2/0.80
1.9/0.7
1.9/1.11
2/0.7
2/0.63
10/3.52
3/1.15
3/1.3
2.1/1.24
2.2/0.77

Fire
3/1.53
2/0.87
5/1.09
3/1.11
2/0.73
1.9/0.9
1.9/1.24
2/0.9
2/0.79
27/8.66
9/3.71
10/3.7
2.3/1.39
1.9/0.80

Heads
2/1.45
2/0.92
3/1.92
2/1.17
1/0.82
0.9/0.6
1.1/1.82
1/0.9
1/0.8
17/13.1
8/6.08
7/5.8
1.8/2.03
1.1/0.80

Office
9/3.61
3/0.91
6/0.86
3/0.98
3/0.82
2.7/0.8
2.6/1.18
3/0.8
2/0.65
16/5.96
7/2.14
7/1.9
2.8/1.26
3.0/0.91

Pumpkin Kitchen
7/3.37
8/3.10
4/1.25
5/1.12
9/1.43
8/1.27
4/1.33
4/1.11
3/1.20
4/1.21
4.2/1.3
4.2/1.1
3.0/1.70
4.2/1.41
4/1.2
4/1.0
3/1.09
3/0.85
22/5.13
19/3.85
9/2.87
10/2.76
9/2.2
8/2.2
4.2/1.71
3.5/1.48
3.7/1.32
4.3/1.04

Stairs
3/2.22
6/1.62
12/2.80
4/1.12
5/1.30
3.9/1.1
4.2/1.42
3/0.8
3/0.83
32/10.6
11/5.58
14/3.6
5.6/1.67
2.7/0.78

Average
5.1/2.46
3.4/1.07
6.6/1.45
3.1/1.09
2.9/0.98
2.8/0.93
2.7/1.41
2.7/0.90
2.29/0.81
20/7.26
8/3.47
8/3.00
3.2/1.54
2.7/0.92

Table 7.4 Camera relocalization benchmark on the 7-Scenes dataset [220]. Note only centimeter
precision was reported for most methods.

input images into an explicit map, e.g., as a 3D point cloud, to more recent learning methods that

directly encode the scene into the weights of a neural network. State-of-the-art methods can be

roughly categorized into: feature matching (FM), end-to-end (E2E), scene coordinate regression

(SCR), and absolute pose regression (APR). To comprehensively compare against existing meth-

ods, we use the 7-scenes dataset [219] following the benchmarks reported in marepo [41] and

HSCNet++ [272].

Implementation. Our method remains the same except for the initialization stage, where we

adopt RoMa [70] along with DUSt3R’s two-view estimation [274], using the image with the

highest similarity score retrieved by DIR [89]. This simple adaptation testifies to the robustness

of our optimization strategy, leveraging deep priors for monocular depth. We note that processing

the 7-Scenes dataset [219] particularly benefits from the scalability of our approach and multi-core

implementation, given the sheer size of the dataset.

Analysis. As summarized in Table 7.4, our approach is comparable to or surpasses state-of-the-art

camera localization algorithms. Noticeably, our method exhibits superior robustness due to the

adoption of the robust inlier-counting scoring function philosophy. In the challenging Stairs scene,

characterized by extensive repetitive and textureless surfaces, our method successfully registers

the cameras by maximizing the inlier count. This is achieved by leveraging the sub-graph scoring

111

Ablation

Acc@3◦ Acc@5◦ Acc@10◦

Initialization
w.o. Coarse-Stage SfM
w.o. Fine-Stage SfM
Full Scheme

DUSt3R [274] Depthmap
DUSt3R [274] Pointcloud

0.125
0.351
0.396
0.432
0.426
0.296

0.359
0.582
0.607
0.636
0.631
0.497

0.678
0.810
0.821
0.833
0.830
0.726

Table 7.5 Ablation on calibrated ScanNet [210]: UniDepth [190] (top) and DUSt3R [274] (bottom)
with RoMa [70].

function with a logarithmic loss. Notably, the state-of-the-art methods HSCNet++, ACE, and

DSAC+ are all learning-based scene coordinate regression approaches. Given the poor ground

truth quality of the 7-Scenes dataset [23], these learning-based methods may inadvertently learn

dataset-specific biases, potentially skewing the comparison in their favor.

7.4.5 Ablation Study

We ablate the key design decisions below in Tab. 7.5.

Algorithm Stages. The coarse stage focuses on registering all frames to their correct locations

even under poor initialization. The fine-stage SfM further refines pose accuracy by emphasizing a

small subset of reliable depth and correspondence pixels. Both stages improve performance.

Depth Format. We compare depth maps to the point clouds recently popularized by DUSt3R [274].

We assume point cloud estimation inherently adopts an over-parameterized pixel-wise intrinsic

model, significantly reducing overall SfM performance. Our results further underscore the benefits

of dense depth maps from powerful MDE models [190].

7.5 Discussion

Large-scale dense-set SfM evaluations. Recent learning-based methods claim to surpass classical

approaches, where such evaluations are typically focused on sparse-set SfM [227, 26, 66, 22].

Of the methods we compare to in our evaluations, FlowMap [227] and AceZero [26] evaluate

COLMAP [211] by assessing image rendering quality after training a NeRF. However, the inherent

randomness and complexity of NeRF training introduce additional factors and unknowns, making

it harder to draw conclusions regarding relative performance on the fundamental SfM task. On

112

Scene

Cubes
Bears
Winter Sign
Inscription
The Rock
Tendrils
Map
Square Bench
Statue
Lawn

Average

marepo [41] DSAC∗ [25](Full) DSAC∗ [25](Tiny) ACE [22] MfS
75.1%
97.0%
80.7%
100%
9.3%
1.0%
28.3%
49.0%
100%
100%
51.5%
34.9%
45.1%
56.5%
58.6%
66.7%
0.0%
0.0%
85.0%
35.8%
52.2% 55.29%

83.8%
82.6%
0.2%
54.1%
100%
25.1%
56.7%
69.5%
0.0%
34.7%
50.7%

71.8%
80.7%
0.0%
37.1%
99.8%
29.3%
55.1%
70.7%
0.0%
34.2%
47.9%

68.7%
73.1%
0.3%
41.3%
99.8%
19.6%
53.3%
60.3%
0.0%
20.0%
43.6%

Table 7.6 Camera relocalization on Wayspots [22] dataset.

the other hand, MASt3R-SfM [66] evaluates SfM performance on the Tanks-and-Temples dataset

(T&T) [131] but only on a sub-sampled version. Moreover, since T&T uses COLMAP-generated

pseudo-ground truth, such evaluations are inherently biased, as has been highlighted in several

studies [213, 23]. In summary, we promote direct comparisons of state-of-the-art feature-matching

methods, such as COLMAP [211] in dense-set SfM, with hundreds to thousands of frames, as was

recently reported in [186, 270].

Pushing the envelop on camera re-localization.

To fully evaluate the efficacy of our MfS

approach for camera re-localization, further evaluation on additional scenarios is needed. Note

our strong results on ETH3D suggest the approach extends to outdoor settings. Evaluation on

object-centric sequences, such as CO3Dv2 [198], would be valuable as learning-based methods

typically perform well in these cases. It would be interesting to explore whether monocular depth

priors alone can compensate for such specialized approaches, potentially reducing the need for

per-scene adaptations as highlighted in recent studies [41].

Implications for 3D and Vision Foundation Models. The success of our optimization-based

approach for multi-view tasks leveraging monocular depth priors, as recently demonstrated as

well by [308], is similar in spirit to the success of detector-free SfM [105] leveraging dense

feature matching to revise the traditional pipeline. Those results highlight the value of dense

predictions, supplementing the recent trends utilizing point clouds following DUSt3R [274]. It

113

Type

Method

Detector-Based

Detector-Free

COLMAP (SIFT+NN)
SIFT + NN + PixSfM [205]
D2Net + NN + PixSfM [205]
R2D2 + NN + PixSfM [205]
SP + SG + PixSfM [205]

LoFTR + PixSfM [205]
DF-SfM [106] + LoFTR
DF-SfM [106] + AspanTrans.
DF-SfM [106] + MatchFormer

Deep-based

VGG-SfM [267]

FoundationMDE

MfS (Ours)

IMC Dataset
AUC@3◦ AUC@5◦ AUC@10◦
34.47
35.73
13.12
42.55
58.43
57.00
59.14
59.88
58.50
58.89
58.40

45.94
47.24
17.25
55.01
71.62
70.43
72.44
73.29
71.99
73.92
73.17

24.87
26.45
10.27
32.44
46.30
44.80
46.9
47.58
46.32
45.23
45.06

Table 7.7 Structure-from-Motion on IMC2021 [122] dataset.

would be interesting to further study this gap and explore effective trade-offs through novel network

architectures.

7.6 Conclusion

We introduced a novel “Motion-from-Structure” approach that leverages monocular depth pri-

ors, offering notable benefits for various multi-view tasks. Our method achieves state-of-the-art

results on challenging datasets like ETH3D [213, 212], while also showing competitive perfor-

mance on ScanNet [53] and 7-Scenes [220]. We highlight the potential of fully capitalizing on

monocular depth priors to advance 3D vision, enabling more efficient and scalable solutions for

complex vision tasks. By eliminating the reliance on traditional SfM initialization and improving

robustness, our approach paves the way for the future integration of monocular depth estimation in

large-scale 3D vision applications.

114

CHAPTER 8

CONCLUSIONS AND FUTURE WORK

8.1 Conclusions

We present a robust system that integrates deep monocular and binocular models with optimiza-

tion techniques to improve structure and motion estimation from images. Our approach enhances

accuracy across both small-scale and large-scale image collections. By utilizing a depth prior that

remains independent of camera motion, our system ensures reliable performance in challenging

scenarios.

In Chapter 4, we introduce a novel monocular 3D prior, the incidence field, to calibrate monoc-

ular images. This incidence field provides a pixel-wise parameterization of intrinsic properties

that remain invariant to image resizing and cropping. To recover camera intrinsics, we develop a

RANSAC-based algorithm that ensures robust estimation. Extensive benchmarking demonstrates

the effectiveness of our method in real-world, in-the-wild scenarios. Beyond calibration, we show-

case multiple downstream applications that benefit from our approach, highlighting its broader

impact on 3D vision tasks.

In Chapter 2 and Chapter 3, we present advancements in depth estimation and geometric

matching by addressing key challenges in self-supervised learning and pretraining strategies. First,

we introduce a depth estimation framework that explicitly leverages the mutual benefits between

self-supervised depth estimation and semantic segmentation. Our approach advances the state-of-

the-art, achieving performance comparable to supervised methods while significantly enhancing

depth boundary accuracy. Additionally, we explore the benefits of pretraining both the encoder

and decoder of a dense geometric matching network using the paired MIM task. By resolving

the discrepancy between pretraining and fine-tuning, we improve geometric matching performance

by reducing ambiguities in textureless regions and enhancing the representation of local planar

surfaces.

In Chapter 5, we decompose two-view Structure-from-Motion (SfM) into three robust sub-

tasks—normalized pose estimation, camera scale estimation, and residual depth estimation—ensuring

115

resilience to deficient views and improving both pose estimation and video-based depth reconstruc-

tion. Building on this, Chapter 6 leverages dense depthmaps and correspondence to achieve SoTA

sparse-view pose accuracy, enabling diverse applications such as self-supervised correspondence

learning, consistent depth estimation, and sparse-view neural rendering. Extending this further,

Chapter 7 generalizes the principles from Chapter 6, demonstrating SoTA performance across var-

ious benchmarks in indoor and outdoor scenes, camera relocalization, and Structure-from-Motion

tasks. Our approach proves effective for both small and large scale camera pose estimation,

showcasing the significant potential of monocular depth estimation in advancing 3D vision.

8.2 Future Work Suggestions

Monocular Depth Estimation. Metric-space monocular depth estimation has become an increas-

ingly important task. Recent studies suggest that camera intrinsics play a crucial role in accurate

metric-space depth estimation. Therefore, depth estimation and camera calibration should be

jointly conducted, i.e., simultaneously estimating camera intrinsics and depth maps, effectively

formulating a monocular SfM approach.

Correspondence Estimation. Geometric matching determines pixel-wise correspondences be-

tween two images. Recent studies have proposed various pixel-wise re-parameterizations of cam-

era motion, highlighting the potential of geometric matching to simultaneously learn both image

matching priors and camera motion priors.

Camera Calibration. Learning-based camera calibration methods still suffer from limitations due

to insufficient camera models. Most datasets are collected using a single camera model, leading to

a lack of diversity in available training data. One potential solution is to leverage large-scale EXIF

image datasets, where focal length and camera model metadata from EXIF files provide a valuable

supervision signal for fine-tuning camera models.

Structure-from-Motion. The SfM pipeline in our approach currently lacks a robust mechanism for

enforcing multi-view consistency in depth triangulation. Integrating a learning-based triangulation

pipeline could enable the system to benefit from both data-driven learning approaches and traditional

optimization-based methods, improving overall reconstruction accuracy.

116

BIBLIOGRAPHY

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M
Seitz, and Richard Szeliski. Building rome in a day. Communications of the ACM, 2011.

Adel Ahmadyan, Liangkai Zhang, Artsiom Ablavatski, Jianing Wei, and Matthias Grund-
mann. Objectron: A large scale dataset of object-centric videos in the wild with pose
annotations. In CVPR, 2021.

Cuneyt Akinlar and Cihan Topal. Edlines: A real-time line segment detector with a false
detection control. Pattern Recognition Letters, 2011.

Paul Alvarez. Using extended file information (exif) file headers in digital evidence analysis.
IJDE, 2004.

Dragomir Anguelov, Carole Dulong, Daniel Filip, Christian Frueh, Stéphane Lafon, Richard
Lyon, Abhijit Ogale, Luc Vincent, and Josh Weaver. Google street view: Capturing the
world at street level. Computer, 2010.

Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. Multi-view depth estimation by fusing
single-view depth probability with multi-view geometry. In CVPR, 2022.

Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. Hpatches: A
benchmark and evaluation of handcrafted and learned local descriptors. In CVPR, 2017.

Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training of image transformers. In
ICLR, 2022.

Daniel Barath, Jiri Matas, and Jana Noskova. Magsac: marginalizing sample consensus. In
CVPR, 2019.

[10] Olga Barinova, Victor Lempitsky, Elena Tretiak, and Pushmeet Kohli. Geometric image

parsing in man-made environments. In ECCV, 2010.

[11]

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman.
Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In CVPR, 2022.

[12] Adrien Bartoli and Peter Sturm. Structure-from-motion using lines: Representation, tri-
angulation, and bundle adjustment. Computer Vision and Image Understanding, 100(3),
2005.

[13] Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas
Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes–a diverse real-
world dataset for 3d indoor scene understanding using mobile rgb-d data. arXiv preprint
arXiv:2111.08897, 2021.

[14] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features

(surf). CVIU, 2008.

117

[15] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features.
In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz,
Austria, May 7-13, 2006. Proceedings, Part I 9, pages 404–417. Springer, 2006.

[16] Sevinç Bayram, İsmail Avcıbaş, Bülent Sankur, and Nasir Memon.

Image manipulation

detection. Journal of Electronic Imaging, 2006.

[17] Christian Beder and Richard Steffen. Determining an initial image pair for fixing the scale

of a 3d reconstruction from an image sequence. In Joint Pattern Recognition Symposium.

[18] Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation

using adaptive bins. In CVPR, 2021.

[19] Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. Zoedepth:
Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288,
2023.

[20] Wenjing Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, and Victor Adrian Prisacariu. Nope-
nerf: Optimising neural radiance field with no pose prior. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), June 2023.

[21] Eric Brachmann, Tommaso Cavallari, and Victor Adrian Prisacariu. Accelerated coordinate
encoding: Learning to relocalize in minutes using rgb and poses. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5044–5053,
2023.

[22] Eric Brachmann, Tommaso Cavallari, and Victor Adrian Prisacariu. Accelerated Coordinate
Encoding: Learning to Relocalize in Minutes Using RGB and Poses. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

[23] Eric Brachmann, Martin Humenberger, Carsten Rother, and Torsten Sattler. On the limits
of pseudo ground truth in visual camera re-localisation. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, 2021.

[24] Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan
Gumhold, and Carsten Rother. Dsac-differentiable ransac for camera localization. In CVPR,
2017.

[25] Eric Brachmann and Carsten Rother. Visual Camera Re-Localization From RGB and RGB-
D Images Using DSAC. IEEE Transactions on Pattern Analysis and Machine Intelligence,
44(9), 2022.

[26] Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cavallari, Áron Monszpart, Daniyar
Turmukhambetov, and Victor Adrian Prisacariu. Scene Coordinate Reconstruction: Posing
of Image Collections via Incremental Learning of a Relocalizer. In ECCV, 2024.

[27] Gary Bradski and Adrian Kaehler. Opencv. Dr. Dobb’s journal of software tools, 2000.

[28] Samarth Brahmbhatt, Jinwei Gu, Kihwan Kim, James Hays, and Jan Kautz. Geometry-aware

learning of maps for camera localization. In CVPR, 2018.

118

[29] Garrick Brazil and Xiaoming Liu. M3D-RPN: Monocular 3D region proposal network for

object detection. In Proceeding of International Conference on Computer Vision, 2019.

[30] Matthew Brown, Gang Hua, and Simon Winder. Discriminative learning of local image

descriptors. PAMI, 2010.

[31]

Julius Butime, Iñigo Gutierrez, L Galo Corzo, and C Flores Espronceda. 3d reconstruction
methods, a survey. In VISAPP, 2006.

[32] Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open

source movie for optical flow evaluation. In ECCV, 2012.

[33] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu,
Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal
dataset for autonomous driving. In CVPR, 2020.

[34] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski,
and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV,
pages 9650–9660, 2021.

[35] Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia Angelova. Depth prediction
without the sensors: Leveraging structure for unsupervised learning from monocular videos.
In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 8001–8008,
2019.

[36] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo
Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-
rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.

[37] Bo Chen, Tat-Jun Chin, and Nan Li. Bpnp: Further empowering end-to-end learning with

back-propagatable geometric optimization. arXiv: 1909.06043, 2019.

[38] Hongkai Chen, Zixin Luo, Lei Zhou, Yurun Tian, Mingmin Zhen, Tian Fang, David Mck-
innon, Yanghai Tsin, and Long Quan. Aspanformer: Detector-free image matching with
adaptive span transformer. In ECCV, 2022.

[39] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya

Sutskever. Generative pretraining from pixels. In ICML, 2020.

[40] Po-Yi Chen, Alexander H Liu, Yen-Cheng Liu, and Yu-Chiang Frank Wang. Towards scene
understanding: Unsupervised monocular depth estimation with semantic-aware representa-
tion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 2624–2632, 2019.

[41] Shuai Chen, Tommaso Cavallari, Victor Adrian Prisacariu, and Eric Brachmann. Map-
In Proceedings of the IEEE/CVF

Relative Pose Regression for Visual Re-Localization.
Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

119

[42] Shuai Chen, Xinghui Li, Zirui Wang, and Victor A Prisacariu. Dfnet: Enhance absolute
pose regression with direct feature matching. In European Conference on Computer Vision.
Springer, 2022.

[43] Shuai Chen, Zirui Wang, and Victor Prisacariu. Direct-PoseNet: Absolute Pose Regression
with Photometric Consistency. In 2021 International Conference on 3D Vision (3DV), 2021.

[44] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple frame-

work for contrastive learning of visual representations. In ICML, 2020.

[45] Yuhua Chen, Cordelia Schmid, and Cristian Sminchisescu. Self-supervised learning with
geometric constraints in monocular video: Connecting flow, depth, and camera. In ICCV,
2019.

[46] Xinjing Cheng, Peng Wang, and Ruigang Yang. Learning depth with convolutional spatial

propagation network. arXiv preprint arXiv:1810.02695, 2018.

[47] Ronald Clark, Michael Bloesch, Jan Czarnowski, Stefan Leutenegger, and Andrew J Davison.
Ls-net: Learning to solve nonlinear least squares for monocular stereo. ECCV, 2018.

[48] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler,
Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for
semantic urban scene understanding. In CVPR, 2016.

[49]

James M Coughlan and Alan L Yuille. Manhattan world: Compass direction from a single
image by bayesian inference. In ICCV, 1999.

[50] David Crandall, Andrew Owens, Noah Snavely, and Dan Huttenlocher. Discrete-continuous

optimization for large-scale structure from motion. In CVPR, 2011.

[51]

Jan Czarnowski, Tristan Laidlow, Ronald Clark, and Andrew J Davison. Deepfactors: Real-
IEEE Robotics and Automation Letters, 5(2),
time probabilistic dense monocular slam.
2020.

[52] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and
Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In PAMI,
pages 5828–5839, 2017.

[53] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and
Matthias Niessner. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2017.

[54] Luanyuan Dai, Xin Liu, Jingtao Wang, Changcai Yang, and Riqing Chen. Learning two-view
correspondences and geometry via local neighborhood correlation. Entropy, 2021.

[55] Yuchao Dai, Zhidong Zhu, Zhibo Rao, and Bo Li. Mvs2: Deep unsupervised multi-view

stereo with multi-view symmetry. In 3DV, 2019.

120

[56] Bibhash Pran Das, Mrutyunjay Biswal, Abhranta Panigrahi, and Manish Okade. Cnn based
image resizing detection and resize factor classification for forensic applications. In ICORT,
2021.

[57] Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse. Monoslam: Real-

time single camera slam. IEEE PAMI, 2007.

[58] Marcio L Lima de Oliveira and Marco JG Bekooij. Deep-mle: Fusion between a neural

network and mle for a single snapshot doa estimation. In ICASSP, 2022.

[59]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale
Hierarchical Image Database. In CVPR09, 2009.

[60] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised nerf: Fewer

views and faster training for free. In CVPR, 2022.

[61] Patrick Denis, James H Elder, and Francisco J Estrada. Efficient edge-based methods for

estimating manhattan frames in urban imagery. In ECCV, 2008.

[62] Konstantinos G Derpanis. Overview of the ransac algorithm. Image Rochester NY, 2010.

[63] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised

interest point detection and description. In CVPRW, 2018.

[64] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly,
et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR,
2021.

[65] Elan Dubrofsky. Homography estimation. Diplomová práce. Vancouver: Univerzita Britské

Kolumbie, 2009.

[66] Bardienus Duisterhof, Lojze Zust, Philippe Weinzaepfel, Vincent Leroy, Yohann Cabon, and
Jerome Revaud. MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-
from-Motion, 2024.

[67] Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii,
and Torsten Sattler. D2-net: A trainable cnn for joint detection and description of local
features. In CVPR, 2019.

[68]

[69]

[70]

Johan Edstedt, Ioannis Athanasiadis, Mårten Wadenbäck, and Michael Felsberg. DKM:
Dense kernelized feature matching for geometry estimation. In CVPR, 2023.

Johan Edstedt, Qiyu Sun, Georg Bökman, Mårten Wadenbäck, and Michael Felsberg. Roma:
Revisiting robust losses for dense feature matching. arXiv preprint arXiv:2305.15404, 2023.

Johan Edstedt, Qiyu Sun, Georg Bökman, Mårten Wadenbäck, and Michael Felsberg. RoMa:
Robust Dense Feature Matching. CVPR, 2024.

121

[71] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image

using a multi-scale deep network. In NeurIPS, 2014.

[72]

Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry. PAMI, 2017.

[73]

[74]

Jakob Engel, Thomas Schöps, and Daniel Cremers. Lsd-slam: Large-scale direct monocular
slam. In ECCV, 2014.

Jakob Engel, Jurgen Sturm, and Daniel Cremers. Semi-dense visual odometry for a monoc-
ular camera. In ICCV, 2013.

[75] Tuo Feng and Dongbing Gu. Sganvo: Unsupervised deep visual odometry and depth

estimation with stacked generative adversarial networks. RA-L, 2019.

[76] Torben Fetzer, Gerd Reis, and Didier Stricker. Stable intrinsic auto-calibration from funda-

mental matrices of devices with uncorrelated camera parameters. In WACV, 2020.

[77] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model
fitting with applications to image analysis and automated cartography. Communications of
the ACM, 1981.

[78] Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo

Kanazawa. Plenoxels: Radiance fields without neural networks. In CVPR, 2022.

[79] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep

ordinal regression network for monocular depth estimation. In CVPR, 2018.

[80] Simon Fuhrmann, Fabian Langguth, and Michael Goesele. Mve-a multi-view reconstruction

environment. In GCH, 2014.

[81] Yasutaka Furukawa and Carlos Hernández. Multi-View Stereo: A Tutorial. Foundations

and Trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015.

[82] Ravi Garg, Vijay Kumar Bg, Gustavo Carneiro, and Ian Reid. Unsupervised cnn for single

view depth estimation: Geometry to the rescue. In ECCV, 2016.

[83] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics:

The kitti dataset. IJRR, 2013.

[84] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving?

the kitti vision benchmark suite. In CVPR, 2012.

[85] Riccardo Gherardi, Michela Farenzena, and Andrea Fusiello. Improving the efficiency of

hierarchical structure-and-motion. In CVPR, 2010.

[86] Ben Glocker, Shahram Izadi, Jamie Shotton, and Antonio Criminisi. Real-time RGB-D

camera relocalization. In ISMAR, 2013.

[87] Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth

estimation with left-right consistency. In CVPR, 2017.

122

[88] Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into

self-supervised monocular depth estimation. In ICCV, 2019.

[89] Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. Deep image retrieval:
Learning global representations for image search. In Computer Vision–ECCV 2016: 14th
European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings,
Part VI 14. Springer, 2016.

[90] Ariel Gordon, Hanhan Li, Rico Jonschkowski, and Anelia Angelova. Depth from videos in
the wild: Unsupervised monocular depth learning from unknown cameras. In ICCV, 2019.

[91] Alexander Grabner, Peter M Roth, and Vincent Lepetit. Gp2c: Geometric projection pa-
In ICCV,

rameter consensus for joint 3d pose and focal length estimation in the wild.
2019.

[92]

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena
Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh-
laghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.
In NeuriPS, 2020.

[93] Xiaodong Gu, Weihao Yuan, Zuozhuo Dai, Siyu Zhu, Chengzhou Tang, Zilong Dong, and
Ping Tan. Dro: Deep recurrent optimizer for video to depth. IEEE Robotics and Automation
Letters, 2023.

[94] Xiaodong Gu, Weihao Yuan, Zuozhuo Dai, Siyu Zhu, Chengzhou Tang, and Ping Tan. Dro:
Deep recurrent optimizer for structure-from-motion. arXiv preprint arXiv:2103.13201, 2021.

[95] Vitor Guizilini, Rares, Ambrus, , Dian Chen, Sergey Zakharov, and Adrien Gaidon. Multi-

frame self-supervised depth with transformers. In CVPR, 2022.

[96] Faiza Gul, Wan Rahiman, and Syed Sahal Nazli Alhady. A comprehensive study for robot

navigation techniques. Cogent Engineering, 2019.

[97] Xiaoyang Guo, Hongsheng Li, Shuai Yi, Jimmy Ren, and Xiaogang Wang. Learning
monocular depth by distilling cross-domain stereo networks. In Proceedings of the European
Conference on Computer Vision (ECCV), pages 484–500, 2018.

[98] Hyowon Ha, Sunghoon Im, Jaesik Park, Hae-Gon Jeon, and In So Kweon. High-quality

depth from uncalibrated small motion clip. In CVPR, 2016.

[99] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cam-

bridge university press, 2003.

[100] Richard I Hartley. In defense of the eight-point algorithm. PAMI, 1997.

[101] Richard I. Hartley. Kruppa’s equations derived from the fundamental matrix. PAMI, 1997.

[102] Richard I. Hartley and Peter Sturm. Triangulation.

In Václav Hlaváč and Radim Šára,

editors, Computer Analysis of Images and Patterns. Springer Berlin Heidelberg, 1995.

123

[103] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked

autoencoders are scalable vision learners. In CVPR, 2022.

[104] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image

recognition. In CVPR, 2016.

[105] Xingyi He, Jiaming Sun, Yifan Wang, Sida Peng, Qixing Huang, Hujun Bao, and Xiaowei
Zhou. Detector-Free Structure from Motion. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), June 2024.

[106] Xingyi He, Jiaming Sun, Yifan Wang, Sida Peng, Qixing Huang, Hujun Bao, and Xiaowei

Zhou. Detector-free structure from motion. CVPR, 2024.

[107] Heiko Hirschmuller. Accurate and efficient stereo processing by semi-global matching and

mutual information. In CVPR, 2005.

[108] Heiko Hirschmuller. Stereo processing by semiglobal matching and mutual information.
IEEE Transactions on pattern analysis and machine intelligence, 30(2):328–341, 2007.

[109] Yannick Hold-Geoffroy, Kalyan Sunkavalli, Jonathan Eisenmann, Matthew Fisher, Emiliano
Gambaretto, Sunil Hadap, and Jean-François Lalonde. A perceptual measure for deep single
image camera calibration. In CVPR, 2018.

[110] Christian Homeyer, Oliver Lange, and Christoph Schnörr. Multi-view monocular depth and

uncertainty prediction with deep sfm in dynamic environments. In ICPRAI, 2022.

[111] Masa Hu, Garrick Brazil, Nanxiang Li, Liu Ren, and Xiaoming Liu. Camera self-calibration

using human faces. In FG, 2023.

[112] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely

connected convolutional networks. In CVPR, 2017.

[113] Zhaoyang Huang, Xiaokun Pan, Runsen Xu, Yan Xu, Guofeng Zhang, Hongsheng Li, et al.

Life: Lighting invariant flow estimation. arXiv preprint arXiv:2104.03097, 2021.

[114] Saif Imran, Yunfei Long, Xiaoming Liu, and Daniel Morris. Depth coefficients for depth
In Proceedings of the IEEE Conference on Computer Vision and Pattern

completion.
Recognition (CVPR), 2019.

[115] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training

by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

[116] Sergio Izquierdo and Javier Civera. Sfm-ttr: Using structure from motion for test-time

refinement of single-view depth networks. In CVPR, 2023.

[117] Joel Janai, Fatma Guney, Anurag Ranjan, Michael Black, and Andreas Geiger. Unsupervised
In Proceedings of the European

learning of multi-frame optical flow with occlusions.
Conference on Computer Vision (ECCV), pages 690–706, 2018.

124

[118] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engin Tola, and Henrik Aanæs. Large scale

multi-view stereopsis evaluation. In CVPR, 2014.

[119] Yoonwoo Jeong, Seokjun Ahn, Christopher Choy, Anima Anandkumar, Minsu Cho, and

Jaesik Park. Self-calibrating neural radiance fields. In CVPR, 2021.

[120] Zijie Jiang, Hajime Taira, Naoyuki Miyashita, and Masatoshi Okutomi. Self-supervised

ego-motion estimation based on multi-layer fusion of rgb and inferred depth. ICRA, 2022.

[121] Linyi Jin, Jianming Zhang, Yannick Hold-Geoffroy, Oliver Wang, Kevin Matzen, Matthew

Sticha, and David Fouhey. Perspective fields for single image camera calibration. 2022.

[122] Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and
Eduard Trulls. Image matching across wide baselines: From paper to practice. International
Journal of Computer Vision, 129(2):517–547, 2021.

[123] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers

are rnns: Fast autoregressive transformers with linear attention. In ICML, 2020.

[124] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to
weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 7482–7491, 2018.

[125] Fadi Khatib, Yuval Margalit, Meirav Galun, and Ronen Basri. Grelpose: Generalizable
end-to-end relative camera pose regression. arXiv preprint arXiv:2211.14950, 2022.

[126] JooSeuk Kim and Clayton D. Scott. Robust kernel density estimation. J. Mach. Learn. Res.,

13(1):2529–2565, sep 2012.

[127] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR,

2015.

[128] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2015.

[129] Josef Kittler. On the accuracy of the sobel edge detector. Image and Vision Computing,

1983.

[130] Georg Klein and David Murray. Parallel tracking and mapping for small ar workspaces. In
2007 6th IEEE and ACM international symposium on mixed and augmented reality, 2007.

[131] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples:
Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (ToG),
36(4), 2017.

[132] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.

IJRR, 2013.

[133] Jana Košecká and Wei Zhang. Video compass. In ECCV, 2002.

[134] Yevhen Kuznietsov, Marc Proesmans, and Luc Van Gool. Comoda: Continuous monocular

depth adaptation using past experiences. In WACV, 2021.

125

[135] Yevhen Kuznietsov, Jorg Stuckler, and Bastian Leibe. Semi-supervised deep learning for
In Proceedings of the IEEE Conference on Computer

monocular depth map prediction.
Vision and Pattern Recognition (CVPR), pages 6647–6655, 2017.

[136] Hyunjoon Lee, Eli Shechtman, Jue Wang, and Seungyong Lee. Automatic upright adjustment

of photographs with robust camera calibration. PAMI, 2013.

[137] Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh.

From big to
small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint
arXiv:1907.10326, 2019.

[138] Jinwoo Lee, Hyunsung Go, Hyunjoon Lee, Sunghyun Cho, Minhyuk Sung, and Junho Kim.

Ctrl-c: Camera calibration transformer with line-classification. In ICCV, 2021.

[139] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Epnp: An accurate o (n) solution

to the pnp problem. IJCV, 2009.

[140] Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3d with

mast3r, 2024.

[141] Haoang Li, Ji Zhao, Jean-Charles Bazin, Wen Chen, Zhe Liu, and Yun-Hui Liu. Quasi-
In ICCV,

globally optimal and efficient vanishing point estimation in manhattan world.
2019.

[142] Hongdong Li. A practical algorithm for l triangulation with outliers. In CVPR, 2007.

[143] Hongdong Li and Richard Hartley. Five-point motion estimation made easy. In ICPR, 2006.

[144] Jianan Li, Xuemei Xie, Qingzhe Pan, Yuhan Cao, Zhifu Zhao, and Guangming Shi. Sgm-net:

Skeleton-guided multimodal network for action recognition. PR, 2020.

[145] Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, and Juho Kannala. Hierarchical
scene coordinate classification and regression for visual localization. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.

[146] Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from

internet photos. In CVPR, 2018.

[147] Yiyi Liao, Jun Xie, and Andreas Geiger. Kitti-360: A novel dataset and benchmarks for
urban scene understanding in 2d and 3d. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2022.

[148] Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon Lucey. Barf: Bundle-adjusting

neural radiance fields. In ICCV, 2021.

[149] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In
ECCV, 2014.

[150] Tony Lindeberg. Scale invariant feature transform. 2012.

126

[151] Ce Liu, Jenny Yuen, and Antonio Torralba. Sift flow: Dense correspondence across scenes

and its applications. PAMI, 2010.

[152] Jinjiang Liu and Xueliang Zhang. Drc-net: Densely connected recurrent convolutional

neural network for speech dereverberation. In ICASSP, 2022.

[153] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl
Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. arXiv preprint arXiv:2303.11328,
2023.

[154] Yaojie Liu and Xiaoming Liu. Spoof trace disentanglement for generic face anti-spoofing.

PAMI, 2022.

[155] Yaojie Liu, Joel Stehouwer, Amin Jourabloo, and Xiaoming Liu. Deep tree learning for

zero-shot face anti-spoofing. In CVPR, 2019.

[156] Xiaoxiao Long, Lingjie Liu, Wei Li, Christian Theobalt, and Wenping Wang. Multi-view

depth estimation using epipolar spatio-temporal networks. In CVPR, 2021.

[157] H Christopher Longuet-Higgins. A computer algorithm for reconstructing a scene from two

projections. Nature, 1981.

[158] Manolis IA Lourakis and Rachid Deriche. Camera self-calibration using the singular value
decomposition of the fundamental matrix: From point correspondences to 3D measurements.
PhD thesis, INRIA, 1999.

[159] David G Lowe. Object recognition from local scale-invariant features. In ICCV, 1999.

[160] David G Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.

[161] Chenxu Luo, Zhenheng Yang, Peng Wang, Yang Wang, Wei Xu, Ram Nevatia, and Alan
Yuille. Every pixel counts++: Joint learning of geometry and motion with 3D holistic
understanding. arXiv preprint arXiv:1810.06125, 2018.

[162] Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. Consistent

video depth estimation. ToG, 2020.

[163] Yue Luo, Jimmy Ren, Mude Lin, Jiahao Pang, Wenxiu Sun, Hongsheng Li, and Liang Lin.
Single view stereo matching. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 155–163, 2018.

[164] Yi Ma, Stefano Soatto, Jana Košecká, and Shankar Sastry. An invitation to 3-d vision: from

images to geometric models, volume 26. Springer, 2004.

[165] Reza Mahjourian, Martin Wicke, and Anelia Angelova. Unsupervised learning of depth and
ego-motion from monocular video using 3d geometric constraints. In CVPR, 2018.

[166] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovit-
skiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical
flow, and scene flow estimation. In CVPR, 2016.

127

[167] Ishit Mehta, Parikshit Sakurikar, and PJ Narayanan. Structured adversarial training for unsu-
pervised monocular depth estimation. In Proceedings of the IEEE International Conference
on 3D Vision (3DV), pages 314–323, 2018.

[168] Iaroslav Melekhov, Aleksei Tiulpin, Torsten Sattler, Marc Pollefeys, Esa Rahtu, and Juho

Kannala. Dgc-net: Dense geometric correspondence network. In WACV, 2019.

[169] Iaroslav Melekhov, Juha Ylioinas, Juho Kannala, and Esa Rahtu. Relative camera pose

estimation using convolutional neural networks. In ACIVS, 2017.

[170] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoor-
thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.
Communications of the ACM, 65(1):99–106, 2021.

[171] Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant

representations. In CVPR, 2020.

[172] Arthur Moreau, Nathan Piasco, Dzmitry Tsishkou, Bogdan Stanciulescu, and Arnaud de La
Fortelle. Lens: Localization enhanced by nerf synthesis. In Proceedings of the 5th Confer-
ence on Robot Learning, volume 164 of Proceedings of Machine Learning Research. PMLR,
Nov 2022.

[173] Arsalan Mousavian, Hamed Pirsiavash, and Jana Košecká. Joint semantic segmentation and
depth estimation with deep convolutional networks. In Proceedings of the IEEE International
Conference on 3D Vision (3DV), pages 611–619, 2016.

[174] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile

and accurate monocular slam system. IEEE transactions on robotics, 2015.

[175] Raul Mur-Artal and Juan D Tardós. Orb-slam2: An open-source slam system for monocular,

stereo, and rgb-d cameras. IEEE Transactions on Robotics, 2017.

[176] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graph-
ics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (SIG-
GRAPH), 41(4):1–15, 2022.

[177] Yosuke Nakagawa, Hideaki Uchiyama, Hajime Nagahara, and Rin-Ichiro Taniguchi. Esti-
mating surface normals with depth image gradients for fast and accurate registration. In
3DV, 2015.

[178] Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and

support inference from rgbd images. In ECCV, 2012.

[179] Richard A Newcombe, Steven J Lovegrove, and Andrew J Davison. Dtam: Dense tracking

and mapping in real-time. In ICCV, 2011.

[180] Michael Niemeyer, Jonathan T Barron, Ben Mildenhall, Mehdi SM Sajjadi, Andreas Geiger,
and Noha Radwan. Regnerf: Regularizing neural radiance fields for view synthesis from
sparse inputs. In CVPR, 2022.

128

[181] David Nistér. An efficient solution to the five-point relative pose problem. PAMI, 2004.

[182] Li Niu, Wenyan Cong, Liu Liu, Yan Hong, Bo Zhang, Jing Liang, and Liqing Zhang.
Making images real again: A comprehensive survey on deep image composition. arXiv
preprint arXiv:2106.14490, 2021.

[183] Carl Olsson, Anders Eriksson, and Richard Hartley. Outlier removal using duality. In CVPR,

2010.

[184] Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi. Lf-net: Learning local features

from images. In NeurIPS, 2018.

[185] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive

predictive coding. CoRR, 2018.

[186] Linfei Pan, Daniel Barath, Marc Pollefeys, and Johannes Lutz Schönberger. Global Structure-
from-Motion Revisited. In European Conference on Computer Vision (ECCV), 2024.

[187] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differen-
tiation in pytorch. 2017.

[188] Cosimo Patruno, Roberto Marani, Grazia Cicirelli, Ettore Stella, and Tiziana D’Orazio.
People re-identification using skeleton standard posture and color descriptors from rgb-d
data. Pattern Recognition, 2019.

[189] Luigi Piccinelli, Christos Sakaridis, and Fisher Yu. idisc: Internal discretization for monoc-

ular depth estimation. In CVPR, 2023.

[190] Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool,
and Fisher Yu. UniDepth: Universal monocular metric depth estimation. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

[191] Sudeep Pillai, Rareş Ambruş, and Adrien Gaidon. Superdepth: Self-supervised, super-
resolved monocular depth estimation. In Proceedings of the International Conference on
Robotics and Automation (ICRA), pages 9250–9256, 2019.

[192] Simon Placht, Peter Fürsattel, Etienne Assoumou Mengue, Hannes Hofmann, Christian
Schaller, Michael Balda, and Elli Angelopoulou. Rochade: Robust checkerboard advanced
detection for camera calibration. In ECCV.

[193] Matteo Poggi, Fabio Tosi, and Stefano Mattoccia. Learning monocular depth estimation with
unsupervised trinocular assumptions. In Proceedings of the IEEE International Conference
on 3D Vision (3DV), pages 324–333, 2018.

[194] Pierluigi Zama Ramirez, Matteo Poggi, Fabio Tosi, Stefano Mattoccia, and Luigi Di Stefano.
Geometry meets semantics for semi-supervised monocular depth estimation. In Proceedings
of the Asian Conference on Computer Vision (ACCV), pages 298–313, 2018.

129

[195] René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards
robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.
PAMI.

[196] Rene Ranftl, Vibhav Vineet, Qifeng Chen, and Vladlen Koltun. Dense monocular depth

estimation in complex dynamic scenes. In CVPR, 2016.

[197] Anurag Ranjan, Varun Jampani, Lukas Balles, Kihwan Kim, Deqing Sun, Jonas Wulff, and
Michael J Black. Competitive collaboration: Joint unsupervised learning of depth, camera
motion, optical flow and motion segmentation. In CVPR, 2019.

[198] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut,
and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life
3d category reconstruction. In Proceedings of the IEEE/CVF International Conference on
Computer Vision (ICCV), October 2021.

[199] Jerome Revaud, Cesar De Souza, Martin Humenberger, and Philippe Weinzaepfel. R2d2:

Reliable and repeatable detector and descriptor. In NeuriPS, 2019.

[200] Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista,
Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic synthetic
dataset for holistic indoor scene understanding. In ICCV, 2021.

[201] Ignacio Rocco, Relja Arandjelović, and Josef Sivic. Efficient neighbourhood consensus

networks via submanifold sparse convolutions. In ECCV, 2020.

[202] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient

alternative to sift or surf. In ICCV, 2011.

[203] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse

to fine: Robust hierarchical localization at large scale. In CVPR, 2019.

[204] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Super-

glue: Learning feature matching with graph neural networks. In CVPR, 2020.

[205] Paul-Edouard Sarlin, Philipp Lindenberger, Viktor Larsson, and Marc Pollefeys. Pixel-

perfect structure-from-motion with featuremetric refinement. PAMI, 2023.

[206] Paul-Edouard Sarlin, Ajaykumar Unagar, Mans Larsson, Hugo Germain, Carl Toft, Viktor
Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, and Torsten
Sattler. Back to the Feature: Learning Robust Camera Localization From Pixels To Pose.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), June 2021.

[207] Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Efficient & effective prioritized matching
for large-scale image-based localization. IEEE transactions on pattern analysis and machine
intelligence, 39(9), 2016.

130

[208] Grant Schindler and Frank Dellaert. Atlanta world: An expectation maximization framework
for simultaneous low-level edge grouping and camera calibration in complex man-made
environments. In CVPR, 2004.

[209] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In CVPR,

2016.

[210] Johannes L. Schonberger and Jan-Michael Frahm. Structure-From-Motion Revisited. In

CVPR, 2016.

[211] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In

Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[212] Thomas Schöps, Torsten Sattler, and Marc Pollefeys. BAD SLAM: Bundle Adjusted Direct
RGB-D SLAM. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

[213] Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad
Schindler, Marc Pollefeys, and Andreas Geiger. A Multi-View Stereo Benchmark with
High-Resolution Images and Multi-Camera Videos. In Conference on Computer Vision and
Pattern Recognition (CVPR), 2017.

[214] Mohammad Shafiei, Sai Bi, Zhengqin Li, Aidas Liaudanskas, Rodrigo Ortiz-Cayon, and
Ravi Ramamoorthi. Learning neural transmittance for efficient rendering of reflectance
fields. 2021.

[215] Mengyi Shan, Brian Curless, Ira Kemelmacher-Shlizerman, and Steve Seitz. Animating

street view. In SIGGRAPH Asia 2023 Conference Papers, 2023.

[216] Yoli Shavit, Ron Ferens, and Yosi Keller. Coarse-to-fine multi-scene pose regression
IEEE transactions on pattern analysis and machine intelligence,

with transformers.
45(12):14222–14233, 2023.

[217] Xi Shen, François Darmon, Alexei A Efros, and Mathieu Aubry. Ransac-flow: generic

two-stage image alignment. In ECCV, 2020.

[218] Xiaoming Liu Shengjie Zhu. Lighteddepth: ideo depth estimation in light of limited inference

view angles. In CVPR, 2023.

[219] Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and
Andrew Fitzgibbon. Scene Coordinate Regression Forests for Camera Relocalization in
RGB-D Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2013.

[220] Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and
Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d
images. In Proceedings of the IEEE conference on computer vision and pattern recognition,
2013.

[221] Chang Shu, Kun Yu, Zhixiang Duan, and Kuiyuan Yang. Feature-metric loss for self-

supervised learning of depth and egomotion. In ECCV, 2020.

131

[222] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation

and support inference from rgbd images. ECCV, 2012.

[223] Bernard W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman &

Hall/CR, New York, 1986.

[224] Gilles Simon, Antoine Fond, and Marie-Odile Berger. A-contrario horizon-first vanishing

point detection using second-order grouping laws. In ECCV, 2018.

[225] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale

image recognition. In ICLR, 2015.

[226] Saurabh Singh and Abhinav Shrivastava. EvalNorm: Estimating batch normalization statis-
tics for evaluation. In The IEEE International Conference on Computer Vision (ICCV), pages
3633–3641, 2019.

[227] Cameron Smith, David Charatan, Ayush Tewari, and Vincent Sitzmann.

Flowmap:
High-quality camera poses, intrinsics, and depth via gradient descent. arXiv preprint
arXiv:2404.15259, 2024.

[228] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo

collections in 3d. In Siggraph. 2006.

[229] Jaime Spencer, Chris Russell, Simon Hadfield, and Richard Bowden. Kick back & relax:

Learning to reconstruct the world by watching slowtv. In ICCV, 2023.

[230] Hauke Strasdat, JMM Montiel, and Andrew J Davison. Real-time monocular slam: Why

filter? In ICRA, 2010.

[231] Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green,
Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clarkson, Mingfei Yan,
Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kimberly Leon, Nigel
Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva,
Dhruv Batra, Hauke M. Strasdat, Renzo De Nardi, Michael Goesele, Steven Lovegrove, and
Richard Newcombe. The Replica dataset: A digital replica of indoor spaces. arXiv preprint
arXiv:1906.05797, 2019.

[232] Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A

benchmark for the evaluation of rgb-d slam systems. In IROS, 2012.

[233] Peter Sturm. Multi-view geometry for general camera models. In CVPR, 2005.

[234] Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-

free local feature matching with transformers. In CVPR, 2021.

[235] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul
Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception
for autonomous driving: Waymo open dataset. In CVPR, 2020.

132

[236] J Sung, H Koppula, B Selman, and A Saxena. Cornell activity datasets: Cad-60 & cad-120,

2014.

[237] Chengzhou Tang and Ping Tan. BA-net: Dense bundle adjustment networks. In ICLR, 2019.

[238] Shitao Tang, Sicong Tang, Andrea Tagliasacchi, Ping Tan, and Yasutaka Furukawa. NeuMap:
Neural Coordinate Mapping by Auto-Transdecoder for Camera Localization. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June
2023.

[239] Shitao Tang, Jiahui Zhang, Siyu Zhu, and Ping Tan. Quadtree attention for vision transform-

ers. In ICLR, 2022.

[240] Zachary Teed and Jia Deng. Deepv2d: Video to depth with differentiable structure from

motion. In ICLR, 2020.

[241] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In

ECCV, 2020.

[242] Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d

cameras. In NeurIPS, 2021.

[243] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas
Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research.
Communications of the ACM, 2016.

[244] Sebastian Thrun. Probabilistic robotics. Communications of the ACM, 2002.

[245] Lokender Tiwari, Pan Ji, Quoc-Huy Tran, Bingbing Zhuang, Saket Anand, and Manmohan
In

Chandraker. Pseudo rgb-d for self-improving monocular slam and depth prediction.
ECCV, 2020.

[246] Philip HS Torr and Andrew Zisserman. Mlesac: A new robust estimator with application to

estimating image geometry. CVIU, 2000.

[247] Fabio Tosi, Filippo Aleotti, Matteo Poggi, and Stefano Mattoccia. Learning monocular depth
estimation infusing traditional stereo knowledge. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 9799–9809, 2019.

[248] Prune Truong, Martin Danelljan, Luc V Gool, and Radu Timofte. Gocor: Bringing globally

optimized correspondence volumes into your neural network. In NeuriPs, 2020.

[249] Prune Truong, Martin Danelljan, and Radu Timofte. Glu-net: Global-local universal network

for dense flow and correspondences. In CVPR, 2020.

[250] Prune Truong, Martin Danelljan, Radu Timofte, and Luc Van Gool. Pdc-net+: Enhanced
probabilistic dense correspondence network. arXiv preprint arXiv:2109.13912, 2021.

[251] Prune Truong, Martin Danelljan, Radu Timofte, and Luc Van Gool. Pdc-net+: Enhanced
probabilistic dense correspondence network. arXiv preprint arXiv:2109.13912, 2023.

133

[252] Prune Truong, Martin Danelljan, Luc Van Gool, and Radu Timofte. Learning accurate dense

correspondences and when to trust them. In CVPR, 2021.

[253] Prune Truong, Marie-Julie Rakotosaona, Fabian Manhardt, and Federico Tombari. Sparf:

Neural radiance fields from sparse and noisy poses. In CVPR, 2023.

[254] Tinne Tuytelaars, Krystian Mikolajczyk, et al. Local invariant feature detectors: a survey.

Foundations and trends® in computer graphics and vision, 2008.

[255] Michał Tyszkiewicz, Pascal Fua, and Eduard Trulls. Disk: Learning local features with

policy gradient. In NeuriPS, 2020.

[256] Tluuldcus Ucicr. Feature-based image metamorphosis. Computer graphics, 26:2, 1992.

[257] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas

Geiger. Sparsity invariant cnns. In 3DV, 2017.

[258] Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg, Alexey
Dosovitskiy, and Thomas Brox. Demon: Depth and motion network for learning monocular
stereo. In CVPR, 2017.

[259] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural
information processing systems, 30, 2017.

[260] Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, and
Katerina Fragkiadaki. Sfm-net: Learning of structure and motion from video. arXiv
preprint arXiv:1704.07804, 2017.

[261] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol,
and Léon Bottou. Stacked denoising autoencoders: Learning useful representations in a deep
network with a local denoising criterion. JMLR, 2010.

[262] Rafael Grompone Von Gioi, Jeremie Jakubowicz, Jean-Michel Morel, and Gregory Randall.

Lsd: A fast line segment detector with a false detection control. PAMI, 2008.

[263] Jayakorn Vongkulbhisal, Ricardo Cabral, Fernando De la Torre, and João P Costeira. Motion
from structure (mfs): Searching for 3d objects in cluttered point trajectories. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.

[264] Brandon Wagstaff, Valentin Peretroukhin, and Jonathan Kelly. Self-supervised structure-
arXiv preprint

from-motion through tightly-coupled depth and egomotion networks.
arXiv:2106.04007, 2021.

[265] Fangjinhua Wang, Silvano Galliani, Christoph Vogel, and Marc Pollefeys. Itermvs: Iterative

probability estimation for efficient multi-view stereo. 2022.

[266] Fangjinhua Wang, Silvano Galliani, Christoph Vogel, Pablo Speciale, and Marc Pollefeys.

Patchmatchnet: Learned multi-view patchmatch stereo. In CVPR, 2021.

134

[267] Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual

geometry grounded deep structure from motion. In CVPR, 2024.

[268] Jianyuan Wang, Yiran Zhong, Yuchao Dai, Stan Birchfield, Kaihao Zhang, Nikolai Smolyan-
skiy, and Hongdong Li. Deep two-view structure-from-motion revisited. CVPR, 2021.

[269] Peng Wang, Xiaohui Shen, Zhe Lin, Scott Cohen, Brian Price, and Alan L Yuille. Towards
unified depth and semantic prediction from a single image. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pages 2800–2809, 2015.

[270] Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa.
Continuous 3d perception model with persistent state. arXiv preprint arXiv:2501.12387,
2025.

[271] Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, and Rainer Stiefelhagen. Match-
former: Interleaving attention in transformers for feature matching. In ACCV, 2022.

[272] Shuzhe Wang, Zakaria Laskar, Iaroslav Melekhov, Xiaotian Li, Yi Zhao, Giorgos Tolias,
and Juho Kannala. Hscnet++: Hierarchical scene coordinate classification and regression
for visual localization with transformer. International Journal of Computer Vision, 132(7),
Jul 2024.

[273] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r:

Geometric 3d vision made easy. arXiv preprint arXiv:2312.14132, 2023.

[274] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r:
Geometric 3d vision made easy. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), June 2024.

[275] Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei
Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual
slam. In IROS, 2020.

[276] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli.
assessment: from error visibility to structural similarity. TIP, 2004.

Image quality

[277] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. Nerf–:
Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064,
2021.

[278] Jamie Watson, Michael Firman, Gabriel J Brostow, and Daniyar Turmukhambetov. Self-
supervised monocular depth hints. In Proceedings of the IEEE International Conference on
Computer Vision (ICCV), pages 2162–2171, 2019.

[279] Jamie Watson, Oisin Mac Aodha, Victor Prisacariu, Gabriel Brostow, and Michael Firman.
The temporal opportunist: Self-supervised multi-frame monocular depth. In CVPR, 2021.

[280] Xingkui Wei, Yinda Zhang, Zhuwen Li, Yanwei Fu, and Xiangyang Xue. Deepsfm: Structure

from motion via deep bundle adjustment. In ECCV, 2020.

135

[281] Horst Wildenauer and Allan Hanbury. Robust camera self-calibration from monocular

images of manhattan worlds. In CVPR, 2012.

[282] Olivia Wiles, Sebastien Ehrhardt, and Andrew Zisserman. Co-attention for conditioned

image matching. In CVPR, 2021.

[283] Scott Workman, Connor Greenwell, Menghua Zhai, Ryan Baltenberger, and Nathan Jacobs.

Deepfocal: A method for direct focal length estimation. In ICIP, 2015.

[284] Scott Workman, Menghua Zhai, and Nathan Jacobs. Horizon lines in the wil. In BMVC,

2016.

[285] Stephen Wright, Jorge Nocedal, et al. Numerical optimization. Springer Science, 1999.

[286] Changchang Wu. Towards linear-time incremental structure from motion. In 3DV, 2013.

[287] Changchang Wu et al. Visualsfm: A visual structure from motion system. 2011.

[288] Xin Wu, Hao Zhao, Shunkai Li, Yingdian Cao, and Hongbin Zha. Sc-wls: Towards
interpretable feed-forward camera re-localization. In European Conference on Computer
Vision. Springer, 2022.

[289] Jianxiong Xiao, Andrew Owens, and Antonio Torralba. Sun3d: A database of big spaces

reconstructed using sfm and object labels. In ICCV, 2013.

[290] Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollár, and Ross Girshick.

Early convolutions help transformers see better. In NeuriPs, 2021.

[291] Yuxi Xiao, Li Li, Xiaodi Li, and Jian Yao. Deepmle: A robust deep maximum likelihood

estimator for two-view structure from motion. IROS, 2022.

[292] Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and
Han Hu. Simmim: A simple framework for masked image modeling. In CVPR, 2022.

[293] Yiliang Xu, Sangmin Oh, and Anthony Hoogs. A minimum error vanishing point detection

approach for uncalibrated monocular images of man-made environments. In CVPR, 2013.

[294] Ceyuan Yang, Zhirong Wu, Bolei Zhou, and Stephen Lin. Instance localization for self-

supervised detection pretraining. In CVPR, 2021.

[295] Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao.
Depth anything: Unleashing the power of large-scale unlabeled data. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10371–10381,
2024.

[296] Nan Yang, Rui Wang, Jorg Stuckler, and Daniel Cremers. Deep virtual stereo odometry:

Leveraging deep depth prediction for monocular direct sparse odometry. In ECCV, 2018.

[297] Zhenheng Yang, Peng Wang, Wei Xu, Liang Zhao, and Ramakant Nevatia. Unsuper-
vised learning of geometry with edge-aware depth-normal consistency. arXiv preprint
arXiv:1711.03665, 2017.

136

[298] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for

unstructured multi-view stereo. In ECCV, 2018.

[299] Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Recurrent mvsnet

for high-resolution multi-view stereo depth inference. In CVPR, 2019.

[300] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long
Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. In
CVPR, 2020.

[301] Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A

high-fidelity dataset of 3d indoor scenes. In ICCV, 2023.

[302] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant

feature transform. In ECCV, 2016.

[303] Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and
Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. In
ICCV, 2023.

[304] Zhichao Yin, Trevor Darrell, and Fisher Yu. Hierarchical discrete distribution decomposition
for match density estimation. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 6044–6053, 2019.

[305] Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow
and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 1983–1992, 2018.

[306] Qichao Ying, Hang Zhou, Zhenxing Qian, Sheng Li, and Xinpeng Zhang. Robust image
protection countering cropping manipulation. arXiv preprint arXiv:2206.02405, 2022.

[307] Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang Wu, Zizheng
Yan, Chenming Zhu, Zhangyang Xiong, Tianyou Liang, et al. Mvimgnet: A large-scale
dataset of multi-view images. In CVPR, 2023.

[308] Yifan Yu, Shaohui Liu, Rémi Pautrat, Marc Pollefeys, and Viktor Larsson. Relative
Pose Estimation through Affine Corrections of Monocular Depth Priors. arXiv preprint
arXiv:2501.05446, 2025.

[309] Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. New crfs: Neural

window fully-connected crfs for monocular depth estimation. 2022.

[310] Ekim Yurtsever, Jacob Lambert, Alexander Carballo, and Kazuya Takeda. A survey of

autonomous driving: Common practices and emerging technologies. IEEE Access, 2020.

[311] Menghua Zhai, Scott Workman, and Nathan Jacobs. Detecting vanishing points using global

image context in a non-manhattan world. In CVPR, 2016.

137

[312] Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, Kejie Li, Harsh Agarwal, and Ian
Reid. Unsupervised learning of monocular depth estimation and visual odometry with deep
feature reconstruction. In CVPR, 2018.

[313] Hui Zhang, K Wong Kwan-yee, and Guoqiang Zhang. Camera calibration from images of

spheres. PAMI, 2007.

[314] Jiahui Zhang, Dawei Sun, Zixin Luo, Anbang Yao, Lei Zhou, Tianwei Shen, Yurong Chen,
Long Quan, and Hongen Liao. Learning two-view correspondences and geometry using
order-aware network. In ICCV, 2019.

[315] Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester
Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating
geometry in the presence of motion. arXiv preprint arXiv:2410.03825, 2024.

[316] Yueqiang Zhang, Langming Zhou, Haibo Liu, and Yang Shang. A flexible online camera

calibration using line segments. Journal of Sensors.

[317] Zhengyou Zhang. A flexible new technique for camera calibration. PAMI, 2000.

[318] Zhengyou Zhang. Camera calibration with one-dimensional objects. PAMI, 2004.

[319] Zhengyou Zhang. Microsoft kinect sensor and its effect. IEEE multimedia, 19(2):4–10,

2012.

[320] Zhenyu Zhang, Zhen Cui, Chunyan Xu, Yan Yan, Nicu Sebe, and Jian Yang. Pattern-affinitive
propagation across depth, surface normal and semantic segmentation. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4106–4115,
2019.

[321] Zhoutong Zhang, Forrester Cole, Richard Tucker, William T Freeman, and Tali Dekel.

Consistent depth of moving objects in video. TOG, 2021.

[322] Wang Zhao, Shaohui Liu, Yezhi Shu, and Yong-Jin Liu. Towards better generalization: Joint

depth-pose learning without posenet. In CVPR, 2020.

[323] Huizhong Zhou, Benjamin Ummenhofer, and Thomas Brox. Deeptam: Deep tracking and

mapping. In ECCV, 2018.

[324] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong.

ibot: Image bert pre-training with online tokenizer. In ICLR, 2022.

[325] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning

of depth and ego-motion from video. In CVPR, 2017.

[326] Zhengming Zhou and Qiulei Dong. Two-in-one depth: Bridging the gap between monocular

and binocular self-supervised depth estimation. In ICCV, 2023.

[327] Shengjie Zhu, Garrick Brazil, and Xiaoming Liu. The edge of depth: Explicit constraints

between segmentation and depth. In CVPR, 2020.

138

[328] Shengjie Zhu, Abhinav Kumar, Masa Hu, and Xiaoming Liu. Tame a wild camera: In-the-
wild monocular camera calibration. In Thirty-seventh Conference on Neural Information
Processing Systems, 2023.

[329] Shengjie Zhu, Abhinav Kumar, Masa Hu, and Xiaoming Liu. Tame a wild camera: in-the-
wild monocular camera calibration. Advances in Neural Information Processing Systems,
36, 2024.

[330] Shengjie Zhu and Xiaoming Liu. Lighteddepth: Video depth estimation in light of limited

inference view angles. In CVPR, 2023.

[331] Shengjie Zhu and Xiaoming Liu. Pmatch: Paired masked image modeling for dense geo-

metric matching. In CVPR, 2023.

[332] Shengjie Zhu and Xiaoming Liu. Revisit self-supervised depth estimation with local

structure-from-motion. In ECCV, 2024.

[333] Yi Zhu, Karan Sapra, Fitsum A Reda, Kevin J Shih, Shawn Newsam, Andrew Tao, and Bryan
Catanzaro. Improving semantic segmentation via video propagation and label relaxation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pages 8856–8865, 2019.

[334] Yuliang Zou, Pan Ji, Quoc-Huy Tran, Jia-Bin Huang, and Manmohan Chandraker. Learning
monocular visual odometry via self-supervised long-term modeling. In ECCV, 2020.

139

APPENDIX

A chronological list of all peer-reviewed and under-review publications completed during the Ph.D.

program at MSU is provided below.

• Shengjie Zhu, Ahmed Abdelkader, Mark J. Matthews, Xiaoming Liu, and Wen-Sheng

Chu. "Motion-from-Structure: Leveraging Monocular Depth Priors for Multi-View Tasks."

Manuscript under review at International Conference on Computer Vision. 2025.

• Shengjie Zhu, and Xiaoming Liu. "Revisit Self-supervised Depth Estimation with Local

Structure-from-Motion." Proceedings of the European Conference on Computer Vision.

2024.

• Shengjie Zhu∗, Girish Chandar Ganesan∗, Abhinav Kumar, and Xiaoming Liu. "Remove

Projective LiDAR Depthmap Artifacts via Exploiting Epipolar Geometry." Proceedings of

the European Conference on Computer Vision. 2024.

• Shengjie Zhu, Abhinav Kumar, Masa Hu, and Xiaoming Liu. "Tame a wild camera:

in-

the-wild monocular camera calibration." Proceedings of the Advances in Neural Information

Processing Systems. 2023.

• Shengjie Zhu, and Xiaoming Liu. "PMatch: Paired Masked Image Modeling for Dense

Geometric Matching." Proceedings of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition. 2023.

• Shengjie Zhu, and Xiaoming Liu. "LightedDepth: Video Depth Estimation in light of Limited

Inference View Angles." Proceedings of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition. 2023.

• Shengjie Zhu, Garrick Brazil, and Xiaoming Liu. "The Edge of Depth: Explicit Constraints

between Segmentation and Depth." Proceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition. 2020.

140