CUA-O3D:

Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding

1University of Trento, 2Fondazione Bruno Kessler.


Top is feature distribution analysis of different 2d projected feature embeddings from various foundation models (Lseg, DINOv2 and Stable Diffusion), enumerating on the overall ScanNetV2 train set and counting the frequency of all point features belonging to each 2D model within each bin interval. Bottom is the sample utilizing K-Means to cluster projected 3D features into specified clusters to make segmentation comparisons. Different foundation models illustrate heterogeneous yet complementary results.

Abstract

The lack of a large-scale 3D-text corpus has led recent works to distill open-vocabulary knowledge from vision-language models (VLMs). However, these methods typically rely on a single VLM to align the feature spaces of 3D models within a common language space, which limits the potential of 3D models to leverage the diverse spatial and semantic capabilities encapsulated in various foundation models. In this paper, we propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D, the first model to integrate multiple foundation models—such as CLIP, DINOv2, and Stable Diffusion—into 3D scene understanding. We further introduce a deterministic uncertainty estimation to adaptively distill and harmonize the heterogeneous 2D feature embeddings from these models. Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties across diverse semantic and geometric sensitivities, helping to reconcile heterogeneous representations during training. Extensive experiments on ScanNetV2 and Matterport3D demonstrate that our method not only advances open-vocabulary segmentation but also achieves robust cross-domain alignment and competitive spatial perception capabilities.

Overview of CUA-O3D


We first utilize Lseg, DINOv2 and Stable Diffusion model to extract multi-view posed image embeddings and then use multi-view 3D projection to obtain the projected 3D features \( F^{2D}_i \) to supervise the 3D model training. Three MLP layers are established to map with each 2D model supervisions independently, while a specific noisy scalar prediction \( \sigma_i \) through a deterministic uncertainty estimation will be learned and adopted to adaptively weight the corresponding distillation loss \( L \).


General Performance & Generalization Ability

Open-vocabulary 3D semantic segmentation results. We compare our CUA-O3D with recent fully supervised (Fully-sup.) and zero-shot (Zero-shot) baselines. Our method demonstrates competitive performance on both ScanNetV2 and Matterport3D. † denotes results from origin paper based on Lseg. denotes results from origin paper based on Lseg.


Cross-dataset evaluation. We evaluate the cross-dataset generalization capability of CUA-O3D. We perform this experiment when training on ScanNetV2 and evaluating on Matterport3D (ScanNetV2 → Matterport3D), and vice versa.


Comparison on cross-dataset generalization. Both CUA-O3D and OpenScene are trained on ScanNet, and zero-shot tested on the Matterport3D dataset. ‡ denotes the pure 3D results obtained from the official released model. K = 21 is derived from the original Matterport3D benchmark, while K = 40, 80, 160, is K most common categories from the NYU label set provided in the benchmark.


Experimental results on ScanNetV2 and Matterport in terms of val on linear probing evaluation. Upperbound-full sup. denotes the fully-supervised upperbounding results while Baseline init. means initialize the model from our baseline model and then perform linear probing evaluation.


Visualizations

Left side: K_Means is tapped to cluster the projected 3D feature embeddings based on Lseg. DINOv2, Stable Diffusion and our final distilled feauture predicted by the 3D model. Right side: UMAP is applied to project high-dimension feature into low-dimension one to visualize the structural characteristics. White rectangle highlights the apparent heterogeneous yet complementary results.
Open-vocabulary semantic segmentation comparisons in terms of ScanNetV2 and Matterport. Our approach displays superior performance over the OpenScene, which is regarded as our baseline. Best view zoom in and out. red denotes the significant improvements.
Open-vocabulary semantic segmentation comparisons in terms of ScanNetV2 and Matterport. Our approach displays superior performance over the OpenScene, which is regarded as our baseline. Best view zoom in and out. red denotes the significant improvements.

BibTeX

      @article{li2025cross,
          title={Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding},
          author={Li, Jinlong and Saltori, Cristiano and Poiesi, Fabio and Sebe, Nicu},
          journal={arXiv preprint arXiv:2503.16707},
          year={2025}
        }
      
    
Flag Counter