Skip to the content.

:wave: Welcome to the 3rd Monocular Depth Estimation Challenge Workshop organized at :wave: cvpr2024

image_0026 image_0254 image_0698 depth_0026 depth_0254 depth_0698

Monocular depth estimation (MDE) is an important low-level vision task, with application in fields such as augmented reality, robotics and autonomous vehicles. Recently, there has been an increased interest in self-supervised systems capable of predicting the 3D scene structure without requiring ground-truth LiDAR training data. Automotive data has accelerated the development of these systems, thanks to the vast quantities of data, the ubiquity of stereo camera rigs and the mostly-static world. However, the evaluation process has also remained focused on only the automotive domain and has been largely unchanged since its inception, relying on simple metrics and sparse LiDAR data.

This workshop seeks to answer the following questions:

  1. How well do networks generalize beyond their training distribution relative to humans?
  2. What metrics provide the most insight into the model’s performance? What is the relative weight of simple cues, e.g. height in the image, in networks and humans?
  3. How do the predictions made by the models differ from how humans perceive depth? Are the failure modes the same?

The workshop will therefore consist of two parts: invited keynote talks discussing current developments in MDE and a challenge organized around a novel benchmarking procedure using the SYNS dataset.

:page_facing_up: Paper

paper

:tv: Videos

Introduction

Keynote #1 – Matteo Poggi

Keynote #2 – Vitor Guizilini

The MDEC Challenge

Challenge Participant #1 – Mykola Lavreniuk

Challenge Participant #1 – Guangyuan Zhou

Challenge Participant #1 – Aradhye Agarwal

Challenge Participant #1 – James H. Elder

Keynote #3 – Eric Brachmann

:newspaper: News


:hourglass_flowing_sand: Important Dates


:calendar: Schedule

The workshop will take place on 18 Jun 2024 from 08:30AM – 12:00PM PDT.

NOTE: Times are shown in Pacific Daylight Time. Please take this into account if joining the workshop virtually.

Time (PDT) Event
13:30 - 13:45 Introduction
13:45 - 14:30 Matteo Poggi *
14:30 - 15:15 Vitor Guizilini
15:15 - 15:30 Break
15:30 - 15:50 Matteo Poggi – The Monocular Depth Estimation Challenge
15:50 - 16:00 Mykola Lavreniuk – EVP++ (Challenge Participant)
16:00 - 16:10 Guangyuan Zhou – PICO-MR (Challenge Winner!)
16:10 - 16:20 Aradhye Agarwal – visioniitd (Challenge Participant)
16:20 - 16:30 James H. Elder – Elder Lab (Challenge Participant)
16:30 - 17:15 Eric Brachmann
17:15 - 17:30 Closing Notes

* Fatma was unable to attend the workshop


:microphone: Keynote Speakers

Fatma Güney
Fatma Güney
Assistant Professor
Koç University
Vitor Guizilini
Vitor Guizilini
Staff Research Scientist
Toyota Research Institute
Eric Brachmann
Eric Brachmann
Staff Scientist
Niantic

Fatma Güney is an Assistant Professor at Koc University in Istanbul. She received her PhD from the Max Planck Institute in Germany. Her research focuses on computer vision problems related to autonomous driving. In the last few years, she published papers on monocular depth estimation, unsupervised object segmentation, and future prediction in different representations. She is a recipient of the ERC Starting Grant as well as prestigious fellowships including the Newton Fund Advanced Fellowship and the Marie Curie Individual Fellowship. She regularly serves as a reviewer with multiple outstanding reviewer awards and more recently as an Area Chair in top-tier Computer Vision conferences.

Vitor Guizilini is currently a staff research scientist at the Toyota Research Institute (TRI) in Los Altos, California. His interests include self-supervised learning, monocular depth estimation, unsupervised domain adaptation, implicit visual representations, camera calibration, and vision-based machine learning in general. He has close to 100 publications in related areas, in addition to a large number of patents in different stages of filing and application to various products. He is the creator and currently maintains the packnet-sfm and vidar repositories, two of the most widely used depth estimation codebases in the scientific community. He co-organized the first three editions of the “Frontiers of Monocular 3D Perception” workshop and has also given several talks in top tier conferences.

Eric Brachmann is a staff scientist at Niantic, working on the Lightship Visual Positioning System (VPS). He works at the intersection of machine learning and computer vision, 3D vision in particular. His research revolves around topics such as visual relocalisation, pose estimation, end-to-end learning, robust optimization and feature matching. He publishes his research in the top conferences in computer vision where he is also an active reviewer with several outstanding reviewer mentions. He has co-organized several tutorials and workshops on visual relocalisation and object pose estimation.


:trophy: Challenge Winners

Congratulations to the challenge winners – PICO-MR!

    F F
(Edges)
MEA RMSE Rel Acc
(Edges)
Comp
(Edges)
PICO-MR †D* 23.72 11.01 3.78 6.61 21.24 3.90 4.45
Anonymous ? 23.25 10.78 3.87 6.70 21.70 3.59 9.86
RGA-Robot †S 22.79 11.52 5.21 9.23 28.86 4.15 0.90
EVP++ †D 20.87 10.92 3.71 6.53 19.02 2.88 6.77
Anonymous ? 20.77 9.96 4.33 7.83 27.80 3.45 13.25
3DCreators †D 20.42 10.19 4.41 7.89 23.94 3.61 5.80
visioniitd D 19.07 9.92 4.53 7.96 23.27 3.26 8.00
Anonymous ? 18.60 9.43 3.92 7.16 20.12 2.89 15.65
HIT-AIIA †D 17.83 9.14 4.11 7.73 21.23 2.95 17.81
FRDC-SH †D 17.81 9.75 5.04 8.92 24.01 3.16 14.16
Anonymous ? 17.57 9.13 4.28 8.36 23.35 3.18 20.66
Anonymous ? 16.91 9.07 4.14 7.35 22.05 3.24 18.52
Anonymous ? 16.71 9.25 5.48 11.05 34.20 2.57 18.04
Anonymous ? 16.45 8.89 5.29 10.53 33.67 2.60 18.73
hyc123 D 15.92 9.17 8.25 13.88 43.88 4.11 0.74
ReadingLS †MD* 14.81 8.14 5.01 8.94 29.39 3.28 30.28
Baseline S 13.72 7.76 5.56 9.72 32.04 3.97 21.63
Anonymous ? 13.71 7.55 5.49 9.44 30.74 3.61 18.36
Anonymous ? 11.90 8.08 6.33 10.89 30.46 2.99 33.63
Elder Lab D 11.04 7.09 8.76 15.86 63.32 3.22 40.61

M: Monocular – S: Stereo – D: Ground-truth Depth – D*: Proxy Depth – : Pre-trained Depth Anything

Teams


:checkered_flag: Challenge

Teams submitting to the challenge will also be required to submit a description of their method. As part of the CVPR Workshop Proceedings, we will publish a paper summarizing the results of the challenge, including a description of each method. All challenge participants surpassing the performance of the Garg baseline (by jspenmar) will be added as authors in this paper. Top performers will additionally be invited to present their method at the workshop. This presentation can be either in-person or virtually.

IMPORTANT: We have decided to expand this edition of the challenge beyond self-supervised models. This means we are accepting any monocular method, e.g. supervised, weakly-supervised, multi-task… The only restriction is that the model cannot be trained on any portion of the SYNS(-Patches) dataset and must make the final depth map prediction using only a single image.

[GitHub] — [Challenge]

The challenge focuses on evaluating novel MDE techniques on the SYNS-Patches dataset proposed in this benchmark. This dataset provides a challenging variety of urban and natural scenes, including forests, agricultural settings, residential streets, industrial estates, lecture theatres, offices and more. Furthermore, the high-quality dense ground-truth LiDAR allows for the computation of more informative evaluation metrics, such as those focused on depth discontinuities.

image_0551 image_0893 image_1114 depth_0551 depth_0893 depth_1114

The challenge is hosted on CodaLab. We have provided a GitHub repository containing training and evaluation code for multiple recent SotA approaches to MDE. These will serve as a competitive baseline for the challenge and as a starting point for participants. The challenge leaderboards use the withheld validation and test sets for SYNS-Patches. We additionally encourage evaluation on the public Kitti Eigen-Benchmark dataset.

Submissions will be evaluated on a variety of metrics:

  1. Pointcloud reconstruction: F-Score
  2. Image-based depth: MAE, RMSE, AbsRel
  3. Depth discontinuities: F-Score, Accuracy, Completeness

Challenge winners will be determined based on the pointcloud-based F-Score performance.


:construction_worker: Organizers

Ripudaman Singh Arora
Ripudaman Singh Arora
Principal ML Researcher
Blue River Technology
Jaime Spencer
Jaime Spencer
Data Engineer
Oxa
Fabio Tosi
Fabio Tosi
Junior Assistant Professor
University of Bologna
Matteo Poggi
Matteo Poggi
Tenure-Track Assistant Professor
University of Bologna
Chris Russell
Chris Russell
Associate Professor
Oxford Internet Institute
Simon Hadfield
Simon Hadfield
Associate Professor
University of Surrey
Richard Bowden
Richard Bowden
Professor
University of Surrey