InstantSplat: Unbounded Sparse-view Pose-free Gaussian Splatting in 40 Seconds

The results shown are derived from 3 training views. Our method estimates poses, which are then interpolated to render the video.

Overview

While novel view synthesis (NVS) has made substanial progress in 3D computer vision, it typically requires an initial estimation of camera intrinsics and extrinsics from dense viewpoints. This pre-processing is usually conducted via a Structure-from-Motion (SfM) pipeline, a procedure that can be slow and unreliable, particularly in sparse-view scenarios with insufficient matched features for accurate reconstruction. In this work, we integrate the strengths of point-based representations (e.g., 3D Gaussian Splatting, 3D-GS) with end-to-end dense stereo models (DUSt3R) to tackle the complex yet unresolved issues in NVS under unconstrained settings, which encompasses pose-free and sparse view challenges.

Our framework, InstantSplat, unifies dense stereo priors with 3D-GS to build 3D Gaussians of large-scale scenes from sparse view & pose-free images in less than 1 minute. Specifically, InstantSplat comprises a Coarse Geometric Initialization (CGI) module that swiftly establishes a preliminary scene structure and camera parameters across all training views, utilizing globally-aligned 3D point maps derived from a pre-trained dense stereo pipeline. This is followed by the Fast 3D-Gaussian Optimization (F-3DGO) module, which jointly optimizes the 3D Gaussian attributes and the initialized poses with pose regularization. Experiments conducted on the large-scale outdoor Tanks & Temples datasets demonstrate that InstantSplat significantly improves SSIM (by 32%) while concurrently reducing Absolute Trajectory Error (ATE) by 80%. These establish InstantSplat as a viable solution for scenarios involving pose-free and sparse-view conditions.

Video

Method Overview

Overall Framework of InstantSplat. Starting with sparse, unposed images, the Coarse Geometric initialization (left) rapidly predicts global aligned point clouds and initializes poses (20.6 seconds). Then the Fast 3D-Gaussian Optimization (right) leverages this initialization to conduct streamlined optimizations of 3D Gaussians and camera parameters (16.67 seconds).

Result

Result on Tanks and Temples Benchmark

Visual Comparisons

Ours
CF-3DGS [Fu 2023]
Ours
Nope-NeRF [Bian 2022]
Ours
NeRFmm [Wang 2021]
Ours
CF-3DGS [Fu 2023]
Ours
NoPe-NeRF [Bian 2022]
Ours
NeRFmm [Wang 2021]
 
   
     

BibTeX

     
@misc{fan2024instantsplat,
        title={InstantSplat: Unbounded Sparse-view Pose-free Gaussian Splatting in 40 Seconds}, 

        eprint={2403.20309},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
      }