Privacy-Preserving Depth-Only Open-Vocabulary 3D Semantic Segmentation Via Uncertainty-Guided Test-Time Optimization
Pith reviewed 2026-07-02 13:36 UTC · model grok-4.3
The pith
An uncertainty-guided test-time optimization method improves depth-only open-vocabulary 3D semantic segmentation without additional training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UTTO converts uncertainty into a guidance signal to identify unreliable semantic responses and uses semantic priors from foundation models to regularize their refinement, thereby improving depth-only open-vocabulary 3D semantic segmentation without additional training.
What carries the argument
The UTTO framework, an uncertainty-guided test-time optimization process that flags unreliable responses via uncertainty and regularizes refinement using foundation model semantic priors.
If this is right
- Depth-only inputs can support open-vocabulary 3D segmentation when uncertainty directs the refinement steps.
- Privacy is maintained by excluding RGB data while segmentation quality still rises.
- Foundation model priors can stand in for missing visual cues during test-time adaptation.
- The gains appear across different label sets including 20, 40, and 200 classes on ScanNet.
- No retraining or new annotations are needed to obtain the reported improvements.
Where Pith is reading between the lines
- The same uncertainty-to-guidance conversion could be tested on other depth-based tasks such as 3D instance segmentation under privacy constraints.
- The method hints that test-time regularization may lessen dependence on large labeled 3D datasets for new scenes.
- Similar guidance signals might help in related settings where one modality is missing, such as LiDAR-only perception.
- Evaluating the approach on non-indoor or outdoor depth data would check whether the foundation-model regularization generalizes beyond the reported benchmarks.
Load-bearing premise
Uncertainty estimates can be turned into a reliable signal for identifying which semantic responses need fixing and foundation model priors can correct them effectively when depth data supplies no appearance cues.
What would settle it
Applying UTTO to the ScanNet datasets and measuring no gain or a drop in segmentation accuracy relative to the unrefined depth-only baseline would show the claimed improvement does not hold.
Figures
read the original abstract
Privacy-preserving perception is a critical requirement for deploying 3D scene understanding systems in real-world indoor environments, yet it remains underexplored in open-vocabulary 3D semantic segmentation. Existing methods typically rely on obtaining rich semantic cues from RGB images, which may expose privacy-sensitive visual information. Depth-only 3D geometry provides a privacy-preserving alternative, but the absence of appearance-based semantic cues makes open-vocabulary predictions highly uncertain and less reliable. Under this setting, we propose to convert uncertainty into a guidance signal to identify unreliable semantic responses and use semantic priors from foundation models to regularize their refinement. We present UTTO, an uncertainty-guided test-time optimization framework for depth-only open-vocabulary 3D semantic segmentation. Without additional training, experiments on ScanNet20, ScanNet40, and ScanNet200 demonstrate that UTTO consistently improves depth-only open-vocabulary 3D segmentation and outperforms representative baselines under privacy-preserving conditions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes UTTO, an uncertainty-guided test-time optimization framework for depth-only open-vocabulary 3D semantic segmentation. It converts uncertainty signals from depth geometry into guidance to identify unreliable predictions and uses semantic priors from foundation models to regularize refinement during test-time optimization, without any additional training. Experiments on ScanNet20, ScanNet40, and ScanNet200 are claimed to show consistent improvements over representative baselines under privacy-preserving conditions.
Significance. If the central claims hold with verifiable quantitative support, the work would address an important gap in privacy-preserving 3D perception by enabling open-vocabulary segmentation from depth alone. This could have practical value for indoor scene understanding where RGB data raises privacy concerns. The use of test-time optimization with external priors is a plausible direction, but its effectiveness depends on the unverified assumptions about uncertainty guidance and regularization.
major comments (3)
- [Abstract] Abstract: The abstract states that UTTO consistently improves performance and outperforms baselines but provides no quantitative metrics, error bars, method details, or ablation studies, making it impossible to verify if the data supports the claim.
- [Method] Method (or §3): No definition is given for the uncertainty measure derived solely from depth geometry (e.g., whether it is entropy, variance, or another quantity), nor for how this signal is converted into an optimization guidance term or how the foundation-model regularization loss is constructed and balanced.
- [Experiments] Experiments: The central claim of consistent improvement on ScanNet20/40/200 requires evidence that depth-only uncertainty correctly flags semantic errors (rather than geometric ambiguity) and that the priors improve rather than hallucinate; without reported numbers, ablations, or failure cases, this load-bearing assumption remains untested.
minor comments (2)
- [Method] Notation for the uncertainty-guided loss and optimization objective should be introduced with explicit equations early in the method section for clarity.
- [Introduction] The paper should include a clear statement of the privacy threat model and confirm that no RGB or appearance data is used at any stage, including during prior extraction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to enhance clarity, provide additional quantitative support, and strengthen the experimental evidence where possible.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract states that UTTO consistently improves performance and outperforms baselines but provides no quantitative metrics, error bars, method details, or ablation studies, making it impossible to verify if the data supports the claim.
Authors: We agree that the abstract would benefit from quantitative support to substantiate the claims. In the revision, we will incorporate key performance metrics (e.g., mIoU improvements on the ScanNet benchmarks) along with standard deviations from repeated runs. revision: yes
-
Referee: [Method] Method (or §3): No definition is given for the uncertainty measure derived solely from depth geometry (e.g., whether it is entropy, variance, or another quantity), nor for how this signal is converted into an optimization guidance term or how the foundation-model regularization loss is constructed and balanced.
Authors: We will revise Section 3 to include an explicit definition of the uncertainty measure derived from depth geometry, along with the mathematical formulation of the guidance term and the construction/balancing of the foundation-model regularization loss, to improve reproducibility. revision: yes
-
Referee: [Experiments] Experiments: The central claim of consistent improvement on ScanNet20/40/200 requires evidence that depth-only uncertainty correctly flags semantic errors (rather than geometric ambiguity) and that the priors improve rather than hallucinate; without reported numbers, ablations, or failure cases, this load-bearing assumption remains untested.
Authors: The manuscript already reports quantitative results and comparisons on ScanNet20/40/200 in the experiments section. To further validate the assumptions, we will add ablations and analysis in the revision showing the relationship between uncertainty signals and semantic errors, as well as discussion of cases involving potential hallucination by the priors. revision: partial
Circularity Check
No circularity: framework uses external priors and test-time optimization without self-referential reduction
full rationale
The abstract and description introduce UTTO as an independent test-time optimization method that converts uncertainty into guidance and applies external foundation-model semantic priors for regularization. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce any claimed result to the inputs by construction. The approach is presented as using off-the-shelf foundation models and evaluated on standard ScanNet benchmarks, making the derivation self-contained against external components.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Open- world semantic segmentation for lidar point clouds
Jun Cen, Peng Yun, Shiwei Zhang, Junhao Cai, Di Luan, Michael Yu Wang, Ming Liu, and Mingqian Tang. Open- world semantic segmentation for lidar point clouds. In Proc. of the Europ. Conf. on Computer Vision (ECCV), 2022. 3
2022
-
[2]
Tiago Cortinhal, George Tzelepis, and Eren Erdal Aksoy. Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds for autonomous driving.arXiv preprint arXiv:2003.03653, 2020. 3
-
[3]
Pla: Language-driven open- vocabulary 3d scene understanding
Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, and Xiaojuan Qi. Pla: Language-driven open- vocabulary 3d scene understanding. InProc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), 2023. 3
2023
-
[4]
Scal- ing open-vocabulary image segmentation with image-level labels
Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal- ing open-vocabulary image segmentation with image-level labels. InProc. of the Europ. Conf. on Computer Vision (ECCV), 2022. 7
2022
-
[5]
Conceptgraphs: Open-vocabulary 3d scene graphs for per- ception and planning
Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for per- ception and planning. InProc. of the IEEE Intl. Conf. on Robotics & Automation (ICRA), 2024. 1
2024
-
[6]
Lidal: Inter-frame uncertainty based active learning for 3d lidar semantic seg- mentation
Zeyu Hu, Xuyang Bai, Runze Zhang, Xin Wang, Guangyuan Sun, Hongbo Fu, and Chiew-Lan Tai. Lidal: Inter-frame uncertainty based active learning for 3d lidar semantic seg- mentation. InProc. of the Europ. Conf. on Computer Vision (ECCV), 2022. 3
2022
-
[7]
Designing privacy-preserving visual perception for robot navigation based on user privacy preferences
Xuying Huang, Sicong Pan, Delphine Reinhardt, and Maren Bennewitz. Designing privacy-preserving visual perception for robot navigation based on user privacy preferences. 2026. 1
2026
-
[8]
Privacy-Preserving Semantic Segmentation from Ultra-Low-Resolution RGB Inputs
Xuying Huang, Sicong Pan, Olga Zatsarynna, Juergen Gall, and Maren Bennewitz. Improved semantic seg- mentation from ultra-low-resolution rgb images applied to privacy-preserving object-goal navigation.arXiv preprint arXiv:2507.16034, 2026. 1
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[9]
Mosaic3d: Foundation dataset and model for open- vocabulary 3d segmentation
Junha Lee, Chunghyun Park, Jaesung Choe, Yu- Chiang Frank Wang, Jan Kautz, Minsu Cho, and Chris Choy. Mosaic3d: Foundation dataset and model for open- vocabulary 3d segmentation. InProc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR),
-
[10]
Cross-modal and uncertainty-aware agglomeration for open- vocabulary 3d scene understanding
Jinlong Li, Cristiano Saltori, Fabio Poiesi, and Nicu Sebe. Cross-modal and uncertainty-aware agglomeration for open- vocabulary 3d scene understanding. InProc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), 2025. 3
2025
-
[11]
Mosaic: Gen- erating consistent, privacy-preserving scenes from multiple depth views in multi-room environments, 2025
Zhixuan Liu, Haokun Zhu, Rui Chen, Jonathan Francis, Soonmin Hwang, Ji Zhang, and Jean Oh. Mosaic: Gen- erating consistent, privacy-preserving scenes from multiple depth views in multi-room environments, 2025. 7
2025
-
[12]
Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance
Phuc Nguyen, Tuan Duc Ngo, Evangelos Kalogerakis, Chuang Gan, Anh Tran, Cuong Pham, and Khoi Nguyen. Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance. InProc. of the IEEE/CVF Conf. on Com- puter Vision and Pattern Recognition (CVPR), 2024. 3
2024
-
[13]
Openscene: 3d scene understanding with open vocabular- ies
Songyou Peng, Kyle Genova, Chiyu Max Jiang, Andrea Tagliasacchi, Marc Pollefeys, and Thomas Funkhouser. Openscene: 3d scene understanding with open vocabular- ies. InProc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2023. 1, 2, 3, 6
2023
-
[14]
Losc: Li- dar open-voc segmentation consolidator
Nermin Samet, Gilles Puy, and Renaud Marlet. Losc: Li- dar open-voc segmentation consolidator. InProc. of the Intl. Conf. on 3D Vision (3DV), 2026. 1
2026
-
[15]
Hanchen Tai, Qingdong He, Jiangning Zhang, Yijie Qian, Zhenyu Zhang, Xiaobin Hu, Xiangtai Li, Yabiao Wang, and Yong Liu. Open-vocabulary sam3d: Towards training-free open-vocabulary 3d scene understanding.arXiv preprint arXiv:2405.15580, 2024. 3
-
[16]
Sumner, Marc Pollefeys, Federico Tombari, and Francis Engelmann
Ayca Takmaz, Elisabetta Fedele, Robert W. Sumner, Marc Pollefeys, Federico Tombari, and Francis Engelmann. Open- mask3d: Open-vocabulary 3d instance segmentation. In Proc. of the Conf. on Neural Information Processing Systems (NeurIPS), 2023. 3
2023
-
[17]
Sumner, Francis Engelmann, Johanna Wald, and Federico Tombari
Ayca Takmaz, Alexandros Delitzas, Robert W. Sumner, Francis Engelmann, Johanna Wald, and Federico Tombari. Search3d: Hierarchical open-vocabulary 3d segmentation. IEEE Robotics and Automation Letters, 10(3):2558–2565,
-
[18]
Openurban3d: Label-free open-vocabulary semantic seg- mentation of large-scale urban point clouds.IEEE Transac- tions on Geoscience and Remote Sensing, 64:4501917, 2026
Chongyu Wang, Kunlei Jing, Jihua Zhu, and Di Wang. Openurban3d: Label-free open-vocabulary semantic seg- mentation of large-scale urban point clouds.IEEE Transac- tions on Geoscience and Remote Sensing, 64:4501917, 2026. 3
2026
-
[19]
Open vocabulary 3d scene un- derstanding via geometry guided self-distillation
Pengfei Wang, Yuxi Wang, Shuai Li, Zhaoxiang Zhang, Zhen Lei, and Lei Zhang. Open vocabulary 3d scene un- derstanding via geometry guided self-distillation. InProc. of the Europ. Conf. on Computer Vision (ECCV), 2024. 3
2024
-
[20]
Masked point-entity contrast for open-vocabulary 3d scene understanding
Yan Wang, Baoxiong Jia, Ziyu Zhu, and Siyuan Huang. Masked point-entity contrast for open-vocabulary 3d scene understanding. InProc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2025. 2
2025
-
[21]
Xmask3d: Cross-modal mask reasoning for open vocabulary 3d semantic segmentation
Ziyi Wang, Yanbo Wang, Xumin Yu, Jie Zhou, and Jiwen Lu. Xmask3d: Cross-modal mask reasoning for open vocabulary 3d semantic segmentation. InProc. of the Conf. on Neural Information Processing Systems (NeurIPS), 2024. 3
2024
-
[22]
Hierarchical open- vocabulary 3d scene graphs for language-grounded robot navigation
Abdelrhman Werby, Chenguang Huang, Martin B ¨uchner, Abhinav Valada, and Wolfram Burgard. Hierarchical open- vocabulary 3d scene graphs for language-grounded robot navigation. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at (ICRA) 2024, 2024. 1
2024
-
[23]
Point transformer v3: Simpler, faster, stronger
Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi- hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler, faster, stronger. In Proc. of the IEEE/CVF Conf. on Computer Vision and Pat- tern Recognition (CVPR), 2024. 8
2024
-
[24]
Sampro3d: Locating sam prompts in 3d for zero-shot instance segmentation
Mutian Xu, Xingyilang Yin, Lingteng Qiu, Yang Liu, Xin Tong, and Xiaoguang Han. Sampro3d: Locating sam prompts in 3d for zero-shot instance segmentation. In Proc. of the Intl. Conf. on 3D Vision (3DV), 2025. 3
2025
-
[25]
Hierarchical point-based active learning for 9 semi-supervised point cloud semantic segmentation
Zongyi Xu, Bo Yuan, Shanshan Zhao, Qianni Zhang, and Xinbo Gao. Hierarchical point-based active learning for 9 semi-supervised point cloud semantic segmentation. In Proc. of the IEEE/CVF Intl. Conf. on Computer Vision (ICCV), 2023. 3
2023
-
[26]
Open-vocabulary 3d semantic segmenta- tion with text-to-image diffusion models
Xiaoyu Zhu, Hao Zhou, Pengfei Xing, Long Zhao, Hao Xu, Junwei Liang, Alexander Hauptmann, Ting Liu, and An- drew Gallagher. Open-vocabulary 3d semantic segmenta- tion with text-to-image diffusion models. InProc. of the Europ. Conf. on Computer Vision (ECCV), 2024. 1, 3 10 Appendix A. Real-Robot Semantic Goal Grounding Case Study We provide additional deta...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.