InstantSplat: Unbounded Sparse-view Pose-free Gaussian Splatting in 10 Seconds

1 The University of Texas at Austin   2 Nvidia   3 Xiamen University
4 Georgia Institute of Technology   5 Stanford University   6 University of Southern California
* denotes equal contribution

The results shown are derived from 3 training views . Our method estimates poses, which are then interpolated to render the video.

Overview

While novel view synthesis (NVS) has made substantial progress in 3D computer vision, it necessitates an accurate estimation of camera intrinsics and extrinsics from dense viewpoints. This pre-processing is usually conducted via a Structure-from-Motion (SfM) pipeline, a procedure that can be slow and unreliable, particularly in sparse-view scenarios with insufficient matched features for accurate reconstruction. In this work, we integrate the strengths of point-based representations with end-to-end dense stereo models to address the complex yet unresolved issues in NVS under unconstrained settings, which encompasses pose-free and sparse view challenges.

Our framework, InstantSplat, unifies dense stereo priors and Gaussian Splatting to build 3D Gaussians of large-scale scenes in less than one minute. Specifically, InstantSplat establishes a coarse but dense scene structure across all training views and resolves the initial camera parameters via fast solvers. However, the initialized dense points lead to excessive Gaussian number with a sub-optimal scene representation and degraded pose accuracy, resulting in inferior rendering quality. We first propose a KNN-based point downsampling by measuring the scene scale and view count, then we leverage a joint optimization framework to refine the 3D Gaussians and camera parameters, incorporating a confidence-aware view regularization to ensure the optimization does not diverge or converge towards an erroneous solution. We achieve a significant reduction in training time, from hours to seconds, by presenting the first framework initialized with noise yet well-covered input and a rapid joint optimization. We demonstrate that InstantSplat performs robustly across various numbers of views on benchmarks and achieves high-resolution and photo-realistic rendering of scenes collected from the Perseverance rover using stereo navigation cameras.

Video

Method Overview

An illustrative overview of our method above, we introduce a new pipeline that incorporates DUSt3R as a 3D prior model, providing globally aligned initial scene geometry for 3D Gaussians. This allows for the subsequent calculation of camera poses and intrinsics from the dense point maps, which are then jointly optimized with all other 3D Gaussian attributes. The supervision signals are derived backward from the photometric discrepancies between the rendered images via splatting and the ground-truth images.

Result

Result on Tanks and Temples Benchmark

We study the impact of the training view number to the rendering quality. Comparisons are between our model and CF-3DGS, on the Tanks and Temples dataset

Visual Comparisons

Ours
CF-3DGS [Fu 2023]
Ours
Nope-NeRF [Bian 2022]
Ours
NeRFmm [Wang 2021]
Ours
CF-3DGS [Fu 2023]
Ours
NoPe-NeRF [Bian 2022]
Ours
NeRFmm [Wang 2021]
 
   
     

BibTeX

     
@misc{fan2024instantsplat,
        title={InstantSplat: Unbounded Sparse-view Pose-free Gaussian Splatting in 40 Seconds}, 
        author={Zhiwen Fan and Wenyan Cong and Kairun Wen and Kevin Wang and Jian Zhang and Xinghao Ding and Danfei Xu and Boris Ivanovic and Marco Pavone and Georgios Pavlakos and Zhangyang Wang and Yue Wang},
        year={2024},
        eprint={2403.20309},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
      }