This project was carried out as part of the 3D Vision course (2026) at ETH Zürich with the Computer Vision and Geometry Group, as our entry to the Hilti × Trimble SLAM Challenge 2026.

Task. Produce a metric, gravity-aligned, floorplan-consistent trajectory and 3D map on the Hilti × Trimble SLAM Challenge 2026 indoor dual-fisheye + IMU dataset.

Problem. OpenVINS drifts on long indoor traverses with repetitive textures.

Our solution combines two pipelines on top of the VIO trajectory — a lightweight COLMAP bundle-adjustment refinement and a full panorama SfM from scratch with tightly-coupled visual-inertial bundle adjustment — plus a unified 2D ICP post-processing step that aligns both outputs to the building floorplan.

  • Team: Gilbert Tanner, Giorgos Evangelou, Julian Lechner
  • Supervisors: Zador Pataki, Xudong Jiang, Paul-Edouard Sarlin, Shaohui Liu
  • Affiliations: ETH Zürich, AAU Klagenfurt & DLR, Google

Download Report Download Poster

Challenge Ranking

Our submission was evaluated on both tracks of the Hilti × Trimble SLAM Challenge 2026:

  • Localization Challenge: ranked 3 / 22
  • SLAM Challenge: ranked 7 / 62

Pipeline Overview

The reconstruction runs in four stages: data preparation, feature extraction and matching, mapping & refinement, and floorplan alignment.

End-to-end pipeline: data prep, features, mapping & refinement, floorplan alignment

Method

Virtual perspective views

Each fisheye pair is remapped into a 4×3 rig of 90° perspective cameras with a 1/θ blend and Voronoi seams, turning the 360° capture into a set of pinhole views that standard SfM tooling can consume.

Virtual perspective views rendered from the dual-fisheye rig

Carrier masking

YOLOE segments the operator, helmet and tablet; the dilated mask (yellow) is unioned with a static carrier template (green) so the moving carrier is excluded from reconstruction.

Carrier masking combining a YOLOE segmentation with a static carrier template

LoMa feature matching

LoMa delivers dense correspondences on the dark, repetitive indoor frames that classical matchers struggle with.

Dense LoMa feature matches on repetitive indoor frames

Loop closure via MegaLoc

MegaLoc descriptors averaged over the twelve virtual cameras yield heading-invariant frame descriptors, catching start/end revisits that close the trajectory.

Loop closure detection using MegaLoc retrieval

Mapping & visual-inertial refinement

On top of the prepared rig observations we run two complementary mapping routes, both starting from the metric, gravity-aligned VIO baseline. Which one we submit is a trade-off between runtime and accuracy.

Lightweight bundle-adjustment refinement

Because re-running global SfM from scratch is expensive, we provide a lighter alternative that only refines the baseline. The dual-fisheye images are remapped onto the same twelve virtual perspective views, masked, feature-extracted and matched. Every rig pose is initialized from the baseline trajectory; 3D points are then triangulated from the known poses and matches, and the whole reconstruction is jointly refined by bundle adjustment with intrinsics, extrinsics and the metric scale frozen. The result is a drift-corrected reconstruction that keeps the gravity-aligned, metric character of the baseline at a small fraction of the cost of the full mapping pipeline.

Full panorama SfM mapping

The full pipeline supports both incremental and global mapping back-ends:

  • Incremental. COLMAP's classical image-by-image registration with fixed intrinsics and rig extrinsics, accelerated with GPU bundle adjustment.
  • Global. A GLOMAP-style pipeline of rotation averaging, global positioning and joint bundle adjustment, where rotation averaging consumes a per-image gravity prior recovered from the IMU to remove the gravity-direction ambiguity from the rotation graph.

Because our virtual cameras form a rig whose relative orientations are known exactly by construction, frozen extrinsics are essential. The upstream global mapper only partially honoured this — some BA passes silently re-optimised the per-camera sensor-from-rig transforms even with intrinsics fixed — so we extended its rig support to enforce frozen extrinsics across every stage of the global pipeline.

Tightly-coupled visual-inertial bundle adjustment

Inertial information enters at two levels. First, the global mapper consumes the IMU-derived per-image gravity prior during rotation averaging. Second, we run a tightly-coupled visual-inertial refinement:

  • Closed-form initialization following Mur-Artal and Tardós first estimates the gyroscope bias from rotation pairs, then solves a least-squares system over consecutive keyframes for the metric scale, gravity direction and per-frame velocities, and finally refines gravity and adds the accelerometer bias under the constraint ‖g‖ = 9.81 m/s².
  • Ceres-based VI bundle adjustment warm-started from that state jointly optimizes camera poses, 3D points, per-keyframe velocities and IMU biases under reprojection residuals and IMU preintegration factors between consecutive keyframes, with bias random-walk priors regularising the biases across the trajectory.

This stage recovers metric scale, brings the reconstruction into a gravity-aligned frame, and significantly tightens the trajectory against the inertial measurements.

Floorplan alignment

Mask2Former on the equirectangular pair isolates wall pixels (top); the wall vote is back-projected to the SfM cloud and 2D ICP aligns it to the floorplan wall mask (bottom), closing the residual yaw and translation gap.

Wall segmentation on the equirectangular image pair
2D ICP alignment of the wall point cloud to the building floorplan

Final aligned reconstruction

The full SfM point cloud (gray) fills the building outline, with the recovered trajectory (red) tracing the operator's path through the site.

Final floorplan-aligned SfM reconstruction with recovered trajectory

The clip below compares our recovered trajectory (estimate) against the ground-truth reference trajectory for sequence floor_1_2025-05-05_run_1, with the front camera view for context.

Recovered trajectory vs. ground truth — sequence floor_1_2025-05-05_run_1 (front camera).

Takeaways

  • 360° rig + LoMa + MegaLoc provide matches and loops that classical features miss on construction-site textures.
  • Tightly-coupled visual-inertial bundle adjustment (closed-form initialization + IMU preintegration) recovers metric scale and gravity.
  • Wall segmentation + 2D ICP closes the residual yaw / translation gap to the floorplan.