InstantSplat: Unbounded Sparse-view Pose-free Gaussian Splatting in 40 Seconds

1 The University of Texas at Austin   2 Nvidia   3 Xiamen University
4 Georgia Institute of Technology   5 Stanford University   6 University of Southern California
* denotes equal contribution

The results shown are derived from 3 training views . Our method estimates poses, which are then interpolated to render the video.

Overview

While novel view synthesis (NVS) has made substantial progress in 3D computer vision, it typically requires an initial estimation of camera intrinsics and extrinsics from dense viewpoints. This pre-processing is usually conducted via a Structure-from-Motion (SfM) pipeline, a procedure that can be slow and unreliable, particularly in sparse-view scenarios with insufficient matched features for accurate reconstruction. In this work, we integrate the strengths of point-based representations (e.g., 3D Gaussian Splatting, 3D-GS) with end-to-end dense stereo models (DUSt3R) to tackle the complex yet unresolved issues in NVS under unconstrained settings, which encompasses pose-free & sparse view challenges.

Our framework, InstantSplat, unifies dense stereo priors with 3D-GS to build 3D Gaussians of large-scale scenes from sparse-view & pose-free images in less than 1 minute. Specifically, InstantSplat comprises a Coarse Geometric Initialization (CGI) module that swiftly establishes a preliminary scene structure and camera parameters across all training views, utilizing globally-aligned 3D point maps derived from a pre-trained dense stereo pipeline. This is followed by the Fast 3D-Gaussian Optimization (F-3DGO) module, which jointly optimizes the 3D Gaussian attributes and the initialized poses with pose regularization. Experiments conducted on the large-scale outdoor Tanks \& Temples datasets demonstrate that InstantSplat significantly improves SSIM (by 32%) while concurrently reducing Absolute Trajectory Error (ATE) by 80%. These establish InstantSplat as a viable solution for scenarios involving pose-free and sparse-view conditions.

Video

Method Overview

An illustrative overview of our method above, we introduce a new pipeline that incorporates DUSt3R as a 3D prior model, providing globally aligned initial scene geometry for 3D Gaussians. This allows for the subsequent calculation of camera poses and intrinsics from the dense point maps, which are then jointly optimized with all other 3D Gaussian attributes. The supervision signals are derived backward from the photometric discrepancies between the rendered images via splatting and the ground-truth images.

Result

Result on Tanks and Temples Benchmark

We study the impact of the training view number to the rendering quality. Comparisons are between our model and CF-3DGS, on the Tanks and Temples dataset

Visual Comparisons

Ours
CF-3DGS [Fu 2023]
Ours
Nope-NeRF [Bian 2022]
Ours
NeRFmm [Wang 2021]
Ours
CF-3DGS [Fu 2023]
Ours
NoPe-NeRF [Bian 2022]
Ours
NeRFmm [Wang 2021]
 
   
     

BibTeX

     
@misc{fan2024instantsplat,
        title={InstantSplat: Unbounded Sparse-view Pose-free Gaussian Splatting in 40 Seconds}, 
        author={Zhiwen Fan and Wenyan Cong and Kairun Wen and Kevin Wang and Jian Zhang and Xinghao Ding and Danfei Xu and Boris Ivanovic and Marco Pavone and Georgios Pavlakos and Zhangyang Wang and Yue Wang},
        year={2024},
        eprint={2403.20309},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
      }