CVPR 2022 WS: Embodied AI - Robotic Vision Scene Understanding Challenge



~ Registration is now closed ~

The CVPR2022 Embodied AI Workshop of our Robotic Vision Scene Understanding Challenge evaluates how well a robotic vision system can understand the semantic and geometric aspects of its environment. The challenge consists of two distinct tasks: Object-based Semantic SLAM, and Scene Change Detection.

Key features of this challenge include:

Submit your results to EvalAI Challenge completed. Evaluation servers currently offline

Watch the video below to learn more!

Challenge Results

At the CVPR 2022 Embodied AI workshop, we were able to announce the winners of this year’s robotic vision scene understanding challenge. Final team rankings and details shown below.

1st place - 1 NVIDIA A6000 GPU, $1,000 USD and Jetson Nanos

Team Name: SP

Team Members: Sayan Paul

Team Affiliations: Tata Consultancy Services Research

Paper: TBA

2nd place - 1 NVIDIA A6000 GPU, $500 USD and Jetson Nanos

Team Name: MSCLab

Team Members: Antyanta Bangunharcana, Soohyun Kim and Kyung-Soo Kim

Team Affiliations: Korea Advanced Institute of Science and Technology

Paper: MSCLab Report

3rd place - $1,000 USD

Team Name: Team Bosch Corporate Research Semantic SLAM

Team Members: Narunas Vaskevicius, Christian Jütte, Reza Sabzevari, Timm Linder, Peter Biber and Stefan Benz

Team Affiliations: Bosch Research

Paper: Panoptic Hierarchical Semantic SLAM

4th place

Team Name: ITL-Team

5th Place

Team Name: WIN (Semantic Fusion)

Challenge Task Details

The semantic scene understanding challenge is split into 3 modes, across 2 different semantic scene understanding tasks, for a total of 6 different challenge variations. The different variations open up the challenge to a wider range of participants, from those who want to focus solely on visual scene understanding to those wish to integrate robot localisation with visual detection algorithms.

The two different semantic scene understanding tasks are:

  1. Semantic SLAM: Participants use a robot to traverse around the environment, building up an object-based semantic map from the robot’s RGBD sensor observations and odomtry measurements.
  2. Scene change detection (SCD): Participants use a robot to traverse through an environment scene, building up a semantic understanding of the scene. Then the robot is moved to a new start position in the same environment, but with different conditions. Along with a possible change from day to night, the new scene has a number objects added and / or removed. Participants must produce an object-based semantic map describing the changes between the two scenes.

The object-based semantic maps generated by submissions are evaluated against the corresponding ground-truth object-based semantic map (for task 2 this is the map of changes in the second scene, with respect to the first). Please see the BenchBot Evaluation documentation for details on object-based semantic maps submission formats, and further explanation of the two different tasks.

Each task has 3 variations, corresponding to the following modes:

  1. Passive control, with ground-truth localisation (PGT): The robot follows a fixed-trajectory, and participants are given a single method to control the robot: moving to the next pose. The task cannot be continued once the entire trajectory has been traversed. Participants receive ground-truth poses for all robot components after each action.
  2. Active control, with ground-truth localisation (AGT): The robot can be controlled by either moving forward a requested distance, or rotating on the spot a requested number of degrees. The task cannot be continued if the robot collides with the environment. Participants receive ground-truth poses for all robot components after each action.
  3. Active control, using dead reckoning (i.e. no localisation) (ADR): The robot can be controlled by either moving forward a requested distance, or rotating on the spot a requested number of degrees. Participants receive poses derived from robot odometry after each action, with localisation error that accumulates over time.

Please see the BenchBot API documentation for full details about what actions and observations are available in each mode, and how to use them with your semantic scene understanding algorithms.

How to Participate

The challenge participation page is available on the EvalAI website, and will be open until May 31st. Please create an account, sign in, & click the “Participate” tab to enter our challenge. Full details on how to participate, the available software framework, and submission requirements are provided on the site.

Participating in the Semantic Scene Understanding Challenge is as simple as the steps below. The BenchBot software stack is designed from the ground up to eliminate as many obstacles as possible, so you can focus on what matters: solving semantic scene understanding problems. A collection of resources, documentation, and examples are also available within the BenchBot ecosystem to support your experience while participating in the challenge.

To participate in our challenge:

  1. Download & install the BenchBot software stack. Use the examples to dive straight in & start playing (benchbot_run –list-examples)
  2. Choose a task to start working on a solution for, using benchbot_run --list-tasks to list supported tasks
  3. Start with the development environments “miniroom” and “house” which include ground-truth maps to aid in your algorithm development
  4. Develop a solution using the benchbot_run, benchbot_submit, & benchbot_eval scripts
  5. Create some results for your solution in the challenge environments using: benchbot_batch -r carter -t <your_task> -E <batch_name> -z -n <your_submission_cmd>
  6. Use the Submit tab at the top of our EvalAI page to submit your results for evaluation

Participants need to provide a short summary of their method at submission time. Each participant is also encouraged to write a short paper (approx 4 pages) outlining the technique which was used in the challenge. These aren’t required at the same time as the initial submission but will be made available to the general public through our challenge website. Please submit summaries and papers by e-mail to

Challenge Prizes

The challenge has a prize pool including:

The best performing teams will be eligible for prizes, with multiple teams being awarded prizes. We encourage teams to submit their best results regardless of if they think they are good enough; you never know, it might end up winning you a GPU!

Important Dates

Workshop Details

Our Robotic Vision Scene Understanding Challenge is currently part of the CVPR 2022 Embodied AI workshop. Please check out the workshop website for more details and for information about 9 other robotic vision/embodied AI challenges being run.


Talk to us on Slack or contact us via email at

Organisers, Support, and Acknowledgements

Stay in touch and follow us on Twitter for news and announcements: @robVisChallenge.

David Hall
Queensland University of Technology
Ben Talbot
Queensland University of Technology
Niko Sünderhauf
Queensland University of Technology

The Robotic Vision Challenges organisers are with the QUT Centre for Robotics (QCR) in Brisbane, Australia.

We thank the following supporters of this work: