The results shown are derived from 3 training views. Our method estimates poses, which are then interpolated to render the video.
While neural 3D reconstruction has advanced substantially, it typically requires densely captured multi-view data with carefully initialized poses (e.g., using COLMAP). However, this requirement limits its broader applicability, as Structure-from-Motion (SfM) is often unreliable in sparse-view scenarios where feature matches are limited, resulting in cumulative errors. In this paper, we introduce InstantSplat, a novel and lightning-fast neural reconstruction system that builds accurate 3D representations from as few as 2-3 images. InstantSplat adopts a self-supervised framework that bridges the gap between 2D images and 3D representations using Gaussian Bundle Adjustment (GauBA) and can be optimized in an end-to-end manner. InstantSplat integrates dense stereo priors and co-visibility relationships between frames to initialize pixel-aligned geometry by progressively expanding the scene avoiding redundancy. Gaussian Bundle Adjustment is used to adapt both the scene representation and camera parameters quickly by minimizing gradient-based photometric error. Overall, InstantSplat achieves large-scale 3D reconstruction in mere seconds by reducing the required number of input views, and is compatible with multiple 3D representations (3D-GS, Mip-Splatting). It achieves an acceleration of over 20 times in reconstruction, improves visual quality (SSIM) from 0.3755 to 0.7624 than COLMAP with 3D-GS.
Overall Framework of InstantSplat. Unlike the modular COLMAP pipeline with Gaussian Splatting, which relies on time-consuming and accuracy-sensitive ADC processes within 3D-GS, and accurate camera poses and sparse point clouds from SfM, InstantSplat employs a deep model to initialize dense surface points. It adopts an end-to-end framework that iteratively optimizes both 3D representation and camera poses, enhancing efficiency and robustness.
InstantSplat processes sparse-view, unposed images to reconstruct an radiance field, capturing detailed scenes rapidly without relying on SfM. Optimization occurs under self-supervision with the support of the large pretrained model, MASt3R.
@misc{fan2024instantsplat,
    title={InstantSplat: Unbounded Sparse-view Pose-free Gaussian Splatting in 40 Seconds},
    author={Zhiwen Fan and Wenyan Cong and Kairun Wen and Kevin Wang and Jian Zhang and Xinghao Ding and Danfei Xu and Boris Ivanovic and Marco Pavone and Georgios Pavlakos and Zhangyang Wang and Yue Wang},
    year={2024},
    eprint={2403.20309},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
   }
 Â