pith. sign in

arxiv: 2607.00978 · v1 · pith:23CHMX4Onew · submitted 2026-07-01 · 💻 cs.CV · cs.RO

Privacy-Preserving Depth-Only Open-Vocabulary 3D Semantic Segmentation Via Uncertainty-Guided Test-Time Optimization

Pith reviewed 2026-07-02 13:36 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords privacy-preservingdepth-onlyopen-vocabulary3D semantic segmentationtest-time optimizationuncertainty guidanceScanNet
0
0 comments X

The pith

An uncertainty-guided test-time optimization method improves depth-only open-vocabulary 3D semantic segmentation without additional training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UTTO, a framework that turns uncertainty estimates into a guidance signal to spot unreliable semantic predictions from depth data alone and then refines them with semantic priors drawn from foundation models. This setup matters because it enables open-vocabulary 3D segmentation in privacy-sensitive indoor environments where RGB images cannot be captured or used. The approach requires no extra training or labeled target data and operates only at test time. Experiments across ScanNet20, ScanNet40, and ScanNet200 show consistent gains over representative baselines when only depth input is available.

Core claim

UTTO converts uncertainty into a guidance signal to identify unreliable semantic responses and uses semantic priors from foundation models to regularize their refinement, thereby improving depth-only open-vocabulary 3D semantic segmentation without additional training.

What carries the argument

The UTTO framework, an uncertainty-guided test-time optimization process that flags unreliable responses via uncertainty and regularizes refinement using foundation model semantic priors.

If this is right

  • Depth-only inputs can support open-vocabulary 3D segmentation when uncertainty directs the refinement steps.
  • Privacy is maintained by excluding RGB data while segmentation quality still rises.
  • Foundation model priors can stand in for missing visual cues during test-time adaptation.
  • The gains appear across different label sets including 20, 40, and 200 classes on ScanNet.
  • No retraining or new annotations are needed to obtain the reported improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same uncertainty-to-guidance conversion could be tested on other depth-based tasks such as 3D instance segmentation under privacy constraints.
  • The method hints that test-time regularization may lessen dependence on large labeled 3D datasets for new scenes.
  • Similar guidance signals might help in related settings where one modality is missing, such as LiDAR-only perception.
  • Evaluating the approach on non-indoor or outdoor depth data would check whether the foundation-model regularization generalizes beyond the reported benchmarks.

Load-bearing premise

Uncertainty estimates can be turned into a reliable signal for identifying which semantic responses need fixing and foundation model priors can correct them effectively when depth data supplies no appearance cues.

What would settle it

Applying UTTO to the ScanNet datasets and measuring no gain or a drop in segmentation accuracy relative to the unrefined depth-only baseline would show the claimed improvement does not hold.

Figures

Figures reproduced from arXiv: 2607.00978 by Maren Bennewitz, Sicong Pan, Xuying Huang.

Figure 1
Figure 1. Figure 1: An example application scenario for privacy-preserving depth-only open-vocabulary 3D perception. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of uncertainty-guided test-time optimization. Given a depth-only geometry input, label-preserving test-time aug [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison under the privacy-preserving depth-only setting. All methods use the same semantic color palette, with [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Real-world indoor scene used in the semantic goal grounding case study. This colorized top-down visualization is shown only [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative visualization of real-robot semantic goal grounding results. For each class-level query, red regions indicate the [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Privacy-preserving perception is a critical requirement for deploying 3D scene understanding systems in real-world indoor environments, yet it remains underexplored in open-vocabulary 3D semantic segmentation. Existing methods typically rely on obtaining rich semantic cues from RGB images, which may expose privacy-sensitive visual information. Depth-only 3D geometry provides a privacy-preserving alternative, but the absence of appearance-based semantic cues makes open-vocabulary predictions highly uncertain and less reliable. Under this setting, we propose to convert uncertainty into a guidance signal to identify unreliable semantic responses and use semantic priors from foundation models to regularize their refinement. We present UTTO, an uncertainty-guided test-time optimization framework for depth-only open-vocabulary 3D semantic segmentation. Without additional training, experiments on ScanNet20, ScanNet40, and ScanNet200 demonstrate that UTTO consistently improves depth-only open-vocabulary 3D segmentation and outperforms representative baselines under privacy-preserving conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes UTTO, an uncertainty-guided test-time optimization framework for depth-only open-vocabulary 3D semantic segmentation. It converts uncertainty signals from depth geometry into guidance to identify unreliable predictions and uses semantic priors from foundation models to regularize refinement during test-time optimization, without any additional training. Experiments on ScanNet20, ScanNet40, and ScanNet200 are claimed to show consistent improvements over representative baselines under privacy-preserving conditions.

Significance. If the central claims hold with verifiable quantitative support, the work would address an important gap in privacy-preserving 3D perception by enabling open-vocabulary segmentation from depth alone. This could have practical value for indoor scene understanding where RGB data raises privacy concerns. The use of test-time optimization with external priors is a plausible direction, but its effectiveness depends on the unverified assumptions about uncertainty guidance and regularization.

major comments (3)
  1. [Abstract] Abstract: The abstract states that UTTO consistently improves performance and outperforms baselines but provides no quantitative metrics, error bars, method details, or ablation studies, making it impossible to verify if the data supports the claim.
  2. [Method] Method (or §3): No definition is given for the uncertainty measure derived solely from depth geometry (e.g., whether it is entropy, variance, or another quantity), nor for how this signal is converted into an optimization guidance term or how the foundation-model regularization loss is constructed and balanced.
  3. [Experiments] Experiments: The central claim of consistent improvement on ScanNet20/40/200 requires evidence that depth-only uncertainty correctly flags semantic errors (rather than geometric ambiguity) and that the priors improve rather than hallucinate; without reported numbers, ablations, or failure cases, this load-bearing assumption remains untested.
minor comments (2)
  1. [Method] Notation for the uncertainty-guided loss and optimization objective should be introduced with explicit equations early in the method section for clarity.
  2. [Introduction] The paper should include a clear statement of the privacy threat model and confirm that no RGB or appearance data is used at any stage, including during prior extraction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to enhance clarity, provide additional quantitative support, and strengthen the experimental evidence where possible.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract states that UTTO consistently improves performance and outperforms baselines but provides no quantitative metrics, error bars, method details, or ablation studies, making it impossible to verify if the data supports the claim.

    Authors: We agree that the abstract would benefit from quantitative support to substantiate the claims. In the revision, we will incorporate key performance metrics (e.g., mIoU improvements on the ScanNet benchmarks) along with standard deviations from repeated runs. revision: yes

  2. Referee: [Method] Method (or §3): No definition is given for the uncertainty measure derived solely from depth geometry (e.g., whether it is entropy, variance, or another quantity), nor for how this signal is converted into an optimization guidance term or how the foundation-model regularization loss is constructed and balanced.

    Authors: We will revise Section 3 to include an explicit definition of the uncertainty measure derived from depth geometry, along with the mathematical formulation of the guidance term and the construction/balancing of the foundation-model regularization loss, to improve reproducibility. revision: yes

  3. Referee: [Experiments] Experiments: The central claim of consistent improvement on ScanNet20/40/200 requires evidence that depth-only uncertainty correctly flags semantic errors (rather than geometric ambiguity) and that the priors improve rather than hallucinate; without reported numbers, ablations, or failure cases, this load-bearing assumption remains untested.

    Authors: The manuscript already reports quantitative results and comparisons on ScanNet20/40/200 in the experiments section. To further validate the assumptions, we will add ablations and analysis in the revision showing the relationship between uncertainty signals and semantic errors, as well as discussion of cases involving potential hallucination by the priors. revision: partial

Circularity Check

0 steps flagged

No circularity: framework uses external priors and test-time optimization without self-referential reduction

full rationale

The abstract and description introduce UTTO as an independent test-time optimization method that converts uncertainty into guidance and applies external foundation-model semantic priors for regularization. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce any claimed result to the inputs by construction. The approach is presented as using off-the-shelf foundation models and evaluated on standard ScanNet benchmarks, making the derivation self-contained against external components.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No specific free parameters, axioms, or invented entities can be identified from the abstract alone; the method relies on unspecified uncertainty estimation and foundation model priors.

pith-pipeline@v0.9.1-grok · 5705 in / 1150 out tokens · 25093 ms · 2026-07-02T13:36:19.365673+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Open- world semantic segmentation for lidar point clouds

    Jun Cen, Peng Yun, Shiwei Zhang, Junhao Cai, Di Luan, Michael Yu Wang, Ming Liu, and Mingqian Tang. Open- world semantic segmentation for lidar point clouds. In Proc. of the Europ. Conf. on Computer Vision (ECCV), 2022. 3

  2. [2]

    Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds for autonomous driving.arXiv preprint arXiv:2003.03653, 2020

    Tiago Cortinhal, George Tzelepis, and Eren Erdal Aksoy. Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds for autonomous driving.arXiv preprint arXiv:2003.03653, 2020. 3

  3. [3]

    Pla: Language-driven open- vocabulary 3d scene understanding

    Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, and Xiaojuan Qi. Pla: Language-driven open- vocabulary 3d scene understanding. InProc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), 2023. 3

  4. [4]

    Scal- ing open-vocabulary image segmentation with image-level labels

    Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal- ing open-vocabulary image segmentation with image-level labels. InProc. of the Europ. Conf. on Computer Vision (ECCV), 2022. 7

  5. [5]

    Conceptgraphs: Open-vocabulary 3d scene graphs for per- ception and planning

    Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for per- ception and planning. InProc. of the IEEE Intl. Conf. on Robotics & Automation (ICRA), 2024. 1

  6. [6]

    Lidal: Inter-frame uncertainty based active learning for 3d lidar semantic seg- mentation

    Zeyu Hu, Xuyang Bai, Runze Zhang, Xin Wang, Guangyuan Sun, Hongbo Fu, and Chiew-Lan Tai. Lidal: Inter-frame uncertainty based active learning for 3d lidar semantic seg- mentation. InProc. of the Europ. Conf. on Computer Vision (ECCV), 2022. 3

  7. [7]

    Designing privacy-preserving visual perception for robot navigation based on user privacy preferences

    Xuying Huang, Sicong Pan, Delphine Reinhardt, and Maren Bennewitz. Designing privacy-preserving visual perception for robot navigation based on user privacy preferences. 2026. 1

  8. [8]

    Privacy-Preserving Semantic Segmentation from Ultra-Low-Resolution RGB Inputs

    Xuying Huang, Sicong Pan, Olga Zatsarynna, Juergen Gall, and Maren Bennewitz. Improved semantic seg- mentation from ultra-low-resolution rgb images applied to privacy-preserving object-goal navigation.arXiv preprint arXiv:2507.16034, 2026. 1

  9. [9]

    Mosaic3d: Foundation dataset and model for open- vocabulary 3d segmentation

    Junha Lee, Chunghyun Park, Jaesung Choe, Yu- Chiang Frank Wang, Jan Kautz, Minsu Cho, and Chris Choy. Mosaic3d: Foundation dataset and model for open- vocabulary 3d segmentation. InProc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR),

  10. [10]

    Cross-modal and uncertainty-aware agglomeration for open- vocabulary 3d scene understanding

    Jinlong Li, Cristiano Saltori, Fabio Poiesi, and Nicu Sebe. Cross-modal and uncertainty-aware agglomeration for open- vocabulary 3d scene understanding. InProc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), 2025. 3

  11. [11]

    Mosaic: Gen- erating consistent, privacy-preserving scenes from multiple depth views in multi-room environments, 2025

    Zhixuan Liu, Haokun Zhu, Rui Chen, Jonathan Francis, Soonmin Hwang, Ji Zhang, and Jean Oh. Mosaic: Gen- erating consistent, privacy-preserving scenes from multiple depth views in multi-room environments, 2025. 7

  12. [12]

    Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance

    Phuc Nguyen, Tuan Duc Ngo, Evangelos Kalogerakis, Chuang Gan, Anh Tran, Cuong Pham, and Khoi Nguyen. Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance. InProc. of the IEEE/CVF Conf. on Com- puter Vision and Pattern Recognition (CVPR), 2024. 3

  13. [13]

    Openscene: 3d scene understanding with open vocabular- ies

    Songyou Peng, Kyle Genova, Chiyu Max Jiang, Andrea Tagliasacchi, Marc Pollefeys, and Thomas Funkhouser. Openscene: 3d scene understanding with open vocabular- ies. InProc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2023. 1, 2, 3, 6

  14. [14]

    Losc: Li- dar open-voc segmentation consolidator

    Nermin Samet, Gilles Puy, and Renaud Marlet. Losc: Li- dar open-voc segmentation consolidator. InProc. of the Intl. Conf. on 3D Vision (3DV), 2026. 1

  15. [15]

    Open-vocabulary sam3d: Towards training-free open-vocabulary 3d scene understanding.arXiv preprint arXiv:2405.15580, 2024

    Hanchen Tai, Qingdong He, Jiangning Zhang, Yijie Qian, Zhenyu Zhang, Xiaobin Hu, Xiangtai Li, Yabiao Wang, and Yong Liu. Open-vocabulary sam3d: Towards training-free open-vocabulary 3d scene understanding.arXiv preprint arXiv:2405.15580, 2024. 3

  16. [16]

    Sumner, Marc Pollefeys, Federico Tombari, and Francis Engelmann

    Ayca Takmaz, Elisabetta Fedele, Robert W. Sumner, Marc Pollefeys, Federico Tombari, and Francis Engelmann. Open- mask3d: Open-vocabulary 3d instance segmentation. In Proc. of the Conf. on Neural Information Processing Systems (NeurIPS), 2023. 3

  17. [17]

    Sumner, Francis Engelmann, Johanna Wald, and Federico Tombari

    Ayca Takmaz, Alexandros Delitzas, Robert W. Sumner, Francis Engelmann, Johanna Wald, and Federico Tombari. Search3d: Hierarchical open-vocabulary 3d segmentation. IEEE Robotics and Automation Letters, 10(3):2558–2565,

  18. [18]

    Openurban3d: Label-free open-vocabulary semantic seg- mentation of large-scale urban point clouds.IEEE Transac- tions on Geoscience and Remote Sensing, 64:4501917, 2026

    Chongyu Wang, Kunlei Jing, Jihua Zhu, and Di Wang. Openurban3d: Label-free open-vocabulary semantic seg- mentation of large-scale urban point clouds.IEEE Transac- tions on Geoscience and Remote Sensing, 64:4501917, 2026. 3

  19. [19]

    Open vocabulary 3d scene un- derstanding via geometry guided self-distillation

    Pengfei Wang, Yuxi Wang, Shuai Li, Zhaoxiang Zhang, Zhen Lei, and Lei Zhang. Open vocabulary 3d scene un- derstanding via geometry guided self-distillation. InProc. of the Europ. Conf. on Computer Vision (ECCV), 2024. 3

  20. [20]

    Masked point-entity contrast for open-vocabulary 3d scene understanding

    Yan Wang, Baoxiong Jia, Ziyu Zhu, and Siyuan Huang. Masked point-entity contrast for open-vocabulary 3d scene understanding. InProc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2025. 2

  21. [21]

    Xmask3d: Cross-modal mask reasoning for open vocabulary 3d semantic segmentation

    Ziyi Wang, Yanbo Wang, Xumin Yu, Jie Zhou, and Jiwen Lu. Xmask3d: Cross-modal mask reasoning for open vocabulary 3d semantic segmentation. InProc. of the Conf. on Neural Information Processing Systems (NeurIPS), 2024. 3

  22. [22]

    Hierarchical open- vocabulary 3d scene graphs for language-grounded robot navigation

    Abdelrhman Werby, Chenguang Huang, Martin B ¨uchner, Abhinav Valada, and Wolfram Burgard. Hierarchical open- vocabulary 3d scene graphs for language-grounded robot navigation. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at (ICRA) 2024, 2024. 1

  23. [23]

    Point transformer v3: Simpler, faster, stronger

    Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi- hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler, faster, stronger. In Proc. of the IEEE/CVF Conf. on Computer Vision and Pat- tern Recognition (CVPR), 2024. 8

  24. [24]

    Sampro3d: Locating sam prompts in 3d for zero-shot instance segmentation

    Mutian Xu, Xingyilang Yin, Lingteng Qiu, Yang Liu, Xin Tong, and Xiaoguang Han. Sampro3d: Locating sam prompts in 3d for zero-shot instance segmentation. In Proc. of the Intl. Conf. on 3D Vision (3DV), 2025. 3

  25. [25]

    Hierarchical point-based active learning for 9 semi-supervised point cloud semantic segmentation

    Zongyi Xu, Bo Yuan, Shanshan Zhao, Qianni Zhang, and Xinbo Gao. Hierarchical point-based active learning for 9 semi-supervised point cloud semantic segmentation. In Proc. of the IEEE/CVF Intl. Conf. on Computer Vision (ICCV), 2023. 3

  26. [26]

    Open-vocabulary 3d semantic segmenta- tion with text-to-image diffusion models

    Xiaoyu Zhu, Hao Zhou, Pengfei Xing, Long Zhao, Hao Xu, Junwei Liang, Alexander Hauptmann, Ting Liu, and An- drew Gallagher. Open-vocabulary 3d semantic segmenta- tion with text-to-image diffusion models. InProc. of the Europ. Conf. on Computer Vision (ECCV), 2024. 1, 3 10 Appendix A. Real-Robot Semantic Goal Grounding Case Study We provide additional deta...