Making a robot understand what it sees is one of the most fascinating goals in my current research. To this end, we develop novel methods for Semantic Mapping and Semantic SLAM by combining object detection with simultaneous localisation and mapping (SLAM) techniques. We furthermore work on Bayesian Deep Learning for object detection, to better understand the uncertainty of a deep network’s predictions and integrate deep learning into robotics in a probabilistic way.

Semantic SLAM and Semantic Mapping

  1. QuadricSLAM: Constrained Dual Quadrics from Object Detections as Landmarks in Object-oriented SLAM Lachlan Nicholson, Michael Milford, Niko Sünderhauf. IEEE Robotics and Automation Letters (RA-L), 2018. In this paper, we use 2D object detections from multiple views to simultaneously estimate a 3D quadric surface for each object and localize the camera position. We derive a SLAM formulation that uses dual quadrics as 3D landmark representations, exploiting their ability to compactly represent the size, position and orientation of an object, and show how 2D object detections can directly constrain the quadric parameters via a novel geometric error formulation. We develop a sensor model for object detectors that addresses the challenge of partially visible objects, and demonstrate how to jointly estimate the camera pose and constrained dual quadric parameters in factor graph based SLAM with a general perspective camera. [arXiv] [website]
  1. An Orientation Factor for Object-Oriented SLAM Natalie Jablonsky, Michael Milford, Niko Sünderhauf. arXiv preprint, 2018. Current approaches to object-oriented SLAM lack the ability to incorporate prior knowledge of the scene geometry, such as the expected global orientation of objects. We overcome this limitation by proposing a geometric factor that constrains the global orientation of objects in the map, depending on the objects’ semantics. This new geometric factor is a first example of how semantics can inform and improve geometry in object-oriented SLAM. We implement the geometric factor for the recently proposed QuadricSLAM that represents landmarks as dual quadrics. The factor probabilistically models the quadrics’ major axes to be either perpendicular to or aligned with the direction of gravity, depending on their semantic class. Our experiments on simulated and real-world datasets show that using the proposed factors to incorporate prior knowledge improves both the trajectory and landmark quality. [arXiv] [website]
  1. QuadricSLAM: Constrained Dual Quadrics from Object Detections as Landmarks in Semantic SLAM Lachlan Nicholson, Michael Milford, Niko Sünderhauf. In Workshop on Representing a Complex World, International Conference on Robotics and Automation (ICRA), 2018. Best Workshop Paper Award We derive a SLAM formulation that uses dual quadrics as 3D landmark representations, exploiting their ability to compactly represent the size, position and orientation of an object, and show how 2D bounding boxes (such as those typically obtained from visual object detection systems) can directly constrain the quadric parameters via a novel geometric error formulation. We develop a sensor model for deep-learned object detectors that addresses the challenge of partial object detections often encountered in robotics applications, and demonstrate how to jointly estimate the camera pose and constrained dual quadric parameters in factor graph based SLAM with a general perspective camera. [arXiv] [website]
  1. Meaningful Maps With Object-Oriented Semantic Mapping Niko Sünderhauf, Trung T. Pham Pham, Yasir Latif, Michael Milford. In Proc. of IEEE International Conference on Intelligent Robots and Systems (IROS), 2017. For intelligent robots to interact in meaningful ways with their environment, they must understand both the geometric and semantic properties of the scene surrounding them. The majority of research to date has addressed these mapping challenges separately, focusing on either geometric or semantic mapping. In this paper we address the problem of building environmental maps that include both semantically meaningful, object-level entities and point- or mesh-based geometrical representations. We simultaneously build geometric point cloud models of previously unseen instances of known object classes and create a map that contains these object models as central entities. Our system leverages sparse, feature-based RGB-D SLAM, image-based deep-learning object detection and 3D unsupervised segmentation.
  1. Place Categorization and Semantic Mapping on a Mobile Robot Niko Sünderhauf, Feras Dayoub, Sean McMahon, Ben Talbot, Ruth Schulz, Peter Corke, Gordon Wyeth, Ben Upcroft, Michael Milford. Proc. of IEEE International Conference on Robotics and Automation (ICRA), 2016. In this paper we focus on the challenging problem of place categorization and semantic mapping on a robot without environment-specific training. Motivated by their ongoing success in various visual recognition tasks, we build our system upon a state-of-the-art convolutional network. We overcome its closed-set limitations by complementing the network with a series of one-vs-all classifiers that can learn to recognize new semantic classes online. Prior domain knowledge is incorporated by embedding the classification system into a Bayesian filter framework that also ensures temporal coherence. We evaluate the classification accuracy of the system on a robot that maps a variety of places on our campus in real-time. We show how semantic information can boost robotic object detection performance and how the semantic map can be used to modulate the robot’s behaviour during navigation tasks. The system is made available to the community as a ROS module.
  1. SLAM – Quo Vadis? In Support of Object Oriented and Semantic SLAM Niko Sünderhauf, Feras Dayoub, Sean McMahon, Markus Eich, Ben Upcroft, Michael Milford. In Workshop on The Problem of Moving Sensors, Robotics: Science and Systems (RSS), 2015. Most current SLAM systems are still based on primitive geometric features such as points, lines, or planes. The created maps therefore carry geometric information, but no immediate semantic information. With the recent significant advances in object detection and scene classification we think the time is right for the SLAM community to ask where the SLAM research should be going during the next years. As a possible answer to this question, we advocate developing SLAM systems that are more object oriented and more semantically enriched than the current state of the art. This paper provides an overview of our ongoing work in this direction.

Scene Understanding

  1. SceneCut: Joint Geometric and Object Segmentation for Indoor Scenes Trung T. Pham, Thanh-Toan Do, Niko Sünderhauf, Ian Reid, Michael Milford. In Proc. of IEEE International Conference on Robotics and Automation (ICRA), 2018. This paper presents SceneCut, a novel approach to jointly discover previously unseen objects and non-object surfaces using a single RGB-D image. SceneCut’s joint reasoning over scene semantics and geometry allows a robot to detect and segment object instances in complex scenes where modern deep learning-based methods either fail to separate object instances, or fail to detect objects that were not seen during training. SceneCut automatically decomposes a scene into meaningful regions which either represent objects or scene surfaces. The decomposition is qualified by an unified energy function over objectness and geometric fitting. We show how this energy function can be optimized efficiently by utilizing hierarchical segmentation trees.
  1. Enhancing Human Action Recognition with Region Proposals Fahimeh Rezazadegan, Sareh Shirazi, Niko Sünderhauf, Michael Milford, Ben Upcroft. In Proceedings of the Australasian Conference on Robotics and Automation (ACRA), 2015.
  1. Continuous Factor Graphs For Holistic Scene Understanding Niko Sünderhauf, Ben Upcroft, Michael Milford. In Workshop on Scene Understanding (SUNw), Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), 2015. We propose a novel mathematical formulation for the holistic scene understanding problem and transform it from the discrete into the continuous domain. The problem can then be modeled with a nonlinear continuous factor graph, and the MAP solution is found via least squares optimization. We evaluate our method on the realistic NYU2 dataset.

Scene Understanding: Hazard Detection

  1. Multi-Modal Trip Hazard Affordance Detection On Construction Sites Sean McMahon, Niko Sünderhauf, Ben Upcroft, Michael J Milford. IEEE Robotics and Automation Letters (RA-L), 2017. Trip hazards are a significant contributor to accidents on construction and manufacturing sites. We conduct a comprehensive investigation into the performance characteristics of 11 different colors and depth fusion approaches, including four fusion and one nonfusion approach, using color and two types of depth images. Trained and tested on more than 600 labeled trip hazards over four floors and 2000 m2 in an active construction site, this approach was able to differentiate between identical objects in different physical configurations. Outperforming a color-only detector, our multimodal trip detector fuses color and depth information to achieve a 4% absolute improvement in F1-score. These investigative results and the extensive publicly available dataset move us one step closer to assistive or fully automated safety inspection systems on construction sites.
  1. Auxiliary Tasks to Improve Trip Hazard Affordance Detection Sean McMahon, Tong Shen, Niko Sünderhauf, Ian Reid, Chunhua Shen, Michael Milford. In Proceedings of the Australasian Conference on Robotics and Automation (ACRA), 2017. We propose to train a CNN performing pixel-wise trip detection with three auxiliary tasks to help the CNN better infer scene geometric properties of trip hazards. Of the three approaches investigated pixel-wise ground plane estimation, pixel depth estimation and pixel height above ground plane estimation, the first approach allowed the trip detector to achieve a 11.1% increase in Trip IOU over earlier work. These new approaches make it plausible to deploy a robotic platform to perform trip hazard detection, and so potentially reduce the number of injuries on construction sites.
  1. TripNet: Detecting Trip Hazards on Construction Sites Sean McMahon, Niko Sünderhauf, Michael Milford, Ben Upcroft. In Proceedings of the Australasian Conference on Robotics and Automation (ACRA), 2015. This paper introduces TripNet, a robotic vision system that detects trip hazards using raw construction site images. TripNet performs trip hazard identification using only camera imagery and minimal training with a pre-trained Convolutional Neural Network (CNN) rapidly fine-tuned on a small corpus of labelled image regions from construction sites. There is no reliance on prior scene segmentation methods during deployment. Trip-Net achieves comparable performance to a human on a dataset recorded in two distinct real world construction sites. TripNet exhibits spatial and temporal generalization by functioning in previously unseen parts of a construction site and over time periods of several weeks.