The results shown are derived from 3 training views . Our method estimates poses, which are then interpolated to render the video.
While novel view synthesis (NVS) has made substantial progress in 3D computer vision, it typically requires an initial estimation of camera intrinsics and extrinsics from dense viewpoints. This pre-processing is usually conducted via a Structure-from-Motion (SfM) pipeline, a procedure that can be slow and unreliable, particularly in sparse-view scenarios with insufficient matched features for accurate reconstruction. In this work, we integrate the strengths of point-based representations (e.g., 3D Gaussian Splatting, 3D-GS) with end-to-end dense stereo models (DUSt3R) to tackle the complex yet unresolved issues in NVS under unconstrained settings, which encompasses pose-free & sparse view challenges.
Our framework, InstantSplat, unifies dense stereo priors with 3D-GS to build 3D Gaussians of large-scale scenes from sparse-view & pose-free images in less than 1 minute. Specifically, InstantSplat comprises a Coarse Geometric Initialization (CGI) module that swiftly establishes a preliminary scene structure and camera parameters across all training views, utilizing globally-aligned 3D point maps derived from a pre-trained dense stereo pipeline. This is followed by the Fast 3D-Gaussian Optimization (F-3DGO) module, which jointly optimizes the 3D Gaussian attributes and the initialized poses with pose regularization. Experiments conducted on the large-scale outdoor Tanks \& Temples datasets demonstrate that InstantSplat significantly improves SSIM (by 32%) while concurrently reducing Absolute Trajectory Error (ATE) by 80%. These establish InstantSplat as a viable solution for scenarios involving pose-free and sparse-view conditions.
An illustrative overview of our method above, we introduce a new pipeline that incorporates DUSt3R as a 3D prior model, providing globally aligned initial scene geometry for 3D Gaussians. This allows for the subsequent calculation of camera poses and intrinsics from the dense point maps, which are then jointly optimized with all other 3D Gaussian attributes. The supervision signals are derived backward from the photometric discrepancies between the rendered images via splatting and the ground-truth images.
Result on Tanks and Temples Benchmark
We study the impact of the training view number to the rendering quality. Comparisons are between our model and CF-3DGS, on the Tanks and Temples dataset
@misc{fan2024instantsplat,
title={InstantSplat: Unbounded Sparse-view Pose-free Gaussian Splatting in 40 Seconds},
author={Zhiwen Fan and Wenyan Cong and Kairun Wen and Kevin Wang and Jian Zhang and Xinghao Ding and Danfei Xu and Boris Ivanovic and Marco Pavone and Georgios Pavlakos and Zhangyang Wang and Yue Wang},
year={2024},
eprint={2403.20309},
archivePrefix={arXiv},
primaryClass={cs.CV}
}