Unsupervised Semantic Segmentation of RGB-D Images Using Combination of Conditional Random Field with Graph Cuts

Mirkamali, Seyedsaeid

Signal and Data Processing Journal A scientific journal officially licensed by the Commission for Scientific Publications of the (MSRT). Publisher: Research Ceter for Developmen of Technologies

EN FA

Volume 22, Issue 4 (3-2026) JSDP 2026, 22(4): 52-39 | Back to browse issues page

Mendeley

Zotero

RefWorks

Mirkamali S. Unsupervised Semantic Segmentation of RGB-D Images Using Combination of Conditional Random Field with Graph Cuts. JSDP 2026; 22 (4) : 3
URL: http://jsdp.rcisp.ac.ir/article-1-1448-en.html

Unsupervised Semantic Segmentation of RGB-D Images Using Combination of Conditional Random Field with Graph Cuts

Seyedsaeid Mirkamali ^*

Assistant Professor Department of Computer Engineering and IT, Payame Noor University, Tehran, Iran

Abstract: (590 Views)

Semantic segmentation seeks to give a set of pixels depicting an object in an image suitable labels depending on their appearance and semantic characteristics. Though it is still one of the most difficult issues in image processing and computer vision, this work has attracted a lot of interest recently.
The availability of RGBD sensors has introduced new possibilities for segmentation by incorporating depth information alongside color. However, effectively combining these modalities presents challenges due to misalignments and depth inaccuracies. This paper proposes CRFCut, a novel unsupervised segmentation method that utilizes a Conditional Random Field (CRF) model optimized with graph cuts to segment RGBD images into coherent regions. The method recursively divides regions into foreground and background layers, employing superpixel-based appearance segmentation for the RGB component and integrating depth cues to refine results. This approach enables robust segmentation, even in the presence of noisy or incomplete depth information.
The CRFCut algorithm begins by separating the depth image into foreground and background regions using a median depth threshold. This initial step requires no preprocessing and provides the basis for further segmentation. Simultaneously, the RGB image is segmented into superpixels using an appearance-based approach, such as the mean-shift algorithm. These superpixels and the depth regions are combined within a CRF model, where labels are assigned by minimizing the energy function using the graph-cut α-expansion algorithm. The algorithm is applied recursively to subdivided regions, allowing finer segmentation in a parallelizable manner.
The proposed method was evaluated on two datasets: the NYUv2 dataset and the MIT dataset. The NYUv2 dataset, which includes 1449 RGBD images with annotated object classes, demonstrated the superior performance of CRFCut compared to five state-of-the-art segmentation techniques in Table 1. In the MIT dataset, which provides human-labeled sequences of indoor and outdoor scenes, CRFCut achieved comparable or better results, even with depth maps generated from 2D images using existing estimation methods (Table 2). The RandIndex metric was used to evaluate segmentation accuracy, and qualitative results, as shown in Figures 3 and 4, highlight CRFCut’s robustness, particularly with noisy or imprecise depth data.
In summary, CRFCut introduces an unsupervised CRF-based approach that integrates RGB and depth information for accurate scene segmentation. By leveraging graph-cut optimization and a recursive structure, the method achieves high-quality segmentation results with minimal preprocessing. Despite some limitations, such as challenges in distinguishing adjacent objects with similar features, CRFCut offers a promising framework for real-time segmentation of RGBD images. Future work will address these limitations by incorporating supervised techniques and improving depth data quality for enhanced performance.

Article number: 3

Keywords: Semantic Segmentation, RGB-D Image, Combination of Conditional Random Field, Graph Cuts

Full-Text [PDF 1313 kb] (168 Downloads)

Type of Study: Research | Subject: Paper
Received: 2024/11/27 | Accepted: 2025/07/21 | Published: 2026/03/20 | ePublished: 2026/03/20

References

1. D. Comaniciu and P. Meer, "Mean shift: A robust approach toward feature space analysis," IEEE Transactions on pattern analysis and machine intelligence, vol. 24, no. 5, pp. 603-619, 2002. [DOI:10.1109/34.1000236]

2. Z. Wu, Z. Zhou, G. Allibert, C. Stolz, C. Demonceaux, and C. Ma, "Transformer fusion for indoor rgb-d semantic segmentation," Computer Vision and Image Understanding, vol. 249, p. 104174, 2024. [DOI:10.1016/j.cviu.2024.104174]

3. C. Liu, W. T. Freeman, E. H. Adelson, and Y. Weiss, "Human-assisted motion annotation," in 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008: IEEE, pp. 1-8. [DOI:10.1109/CVPR.2008.4587845]

4. P. F. Felzenszwalb and D. P. Huttenlocher, "Efficient graph-based image segmentation," International journal of computer vision, vol. 59, pp. 167-181, 2004. [DOI:10.1023/B:VISI.0000022288.19776.77]

5. D. Sun, E. B. Sudderth, and M. J. Black, "Layered segmentation and optical flow estimation over time," in 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012: IEEE, pp. 1768-1775. [DOI:10.1109/CVPR.2012.6247873]

6. L. u. Ladický, C. Russell, P. Kohli, and P. H. Torr, "Associative hierarchical crfs for object class image segmentation," in 2009 IEEE 12th international conference on computer vision, 2009: IEEE, pp. 739-746. [DOI:10.1109/ICCV.2009.5459248]

7. Criminisi, G. Cross, A. Blake, and V. Kolmogorov, "Bilayer segmentation of live video," in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), 2006, vol. 1: IEEE, pp. 53-60. [DOI:10.1109/CVPR.2006.69]

8. M. Szummer, P. Kohli, and D. Hoiem, "Learning CRFs using graph cuts," in Computer Vision-ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part II 10, 2008: Springer, pp. 582-595. [DOI:10.1007/978-3-540-88688-4_43]

9. Y. Boykov, O. Veksler, and R. Zabih, "Fast approximate energy minimization via graph cuts," IEEE Transactions on pattern analysis and machine intelligence, vol. 23, no. 11, pp. 1222-1239, 2001. [DOI:10.1109/34.969114]

10. C. Rother, V. Kolmogorov, and A. Blake, "" GrabCut" interactive foreground extraction using iterated graph cuts," ACM transactions on graphics (TOG), vol. 23, no. 3, pp. 309-314, 2004. [DOI:10.1145/1015706.1015720]

11. S. Mirkamali and P. Nagabhushan, "Depth-wise image inpainting," in Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), 2012: IEEE, pp. 141-144.

12. حاجی اسماعیلی، محمدمهدی، منتظر، غلامعلی، «مروری نقادانه بر روش‌های بازیابی محتوامحور و معناگرای تصاویر»، فصلنامة پردازش علائم و دادهها، 22 (1)، صص 113-141، 1404.

12. M. M. Haji-Esmaeili and G. Montazer, "a Critical Survey on Content-Based & Semantic Image Retrieval - Abstract," (in eng), Signal and Data Processing, Research vol. 22, no. 1, pp. 113-141, 2025, doi: 10.61186/jsdp.22.1.113. [DOI:10.61186/jsdp.22.1.113]

13. J. Shi and J. Malik, "Normalized cuts and image segmentation," IEEE Transactions on pattern analysis and machine intelligence, vol. 22, no. 8, pp. 888-905, 2000. [DOI:10.1109/34.868688]

14. S. Du, W. Wang, R. Guo, R. Wang, and S. Tang, "Asymformer: Asymmetrical cross-modal representation learning for mobile platform real-time rgb-d semantic segmentation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7608-7615. [DOI:10.1109/CVPRW63382.2024.00756]

15. X. He, R. S. Zemel, and D. Ray, "Learning and incorporating top-down cues in image segmentation," in Computer Vision-ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 2006,7-13 Proceedings, Part I 9, 2006: Springer, pp. 338-351. [DOI:10.1007/11744023_27]

16. Ren and Malik, "Learning a classification model for segmentation," in Proceedings ninth IEEE international conference on computer vision, 2003: IEEE, pp. 10-17 vol. 1. [DOI:10.1109/ICCV.2003.1238308]

17. A. Jepson and M. J. Black, "Mixture models for optical flow computation," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1993: IEEE, pp. 760-761. [DOI:10.1109/CVPR.1993.341161]

18. N. Jojic and B. J. Frey, "Learning flexible sprites in video layers," in Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, 2001, vol. 1: IEEE, pp. I-I. [DOI:10.1109/CVPR.2001.990476]

19. D. Sun, E. Sudderth, and M. Black, "Layered image motion with explicit occlusions, temporal consistency, and depth ordering," Advances in Neural Information Processing Systems, vol. 23, 2010.

20. M. Bleyer, C. Rother, P. Kohli, D. Scharstein, and S. Sinha, "Object stereo-joint stereo matching and object segmentation," in CVPR 2011, 2011: IEEE, pp. 3081-3088. [DOI:10.1109/CVPR.2011.5995581]

21. N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, "Indoor segmentation and support inference from rgbd images," in Computer Vision-ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, 2012: Springer, pp. 746-760. [DOI:10.1007/978-3-642-33715-4_54]

22. L. Wang, C. Zhang, R. Yang, and C. Zhang, "Tofcut: Towards robust real-time foreground extraction using a time-of-flight camera," in Proc. of 3DPVT, 2010, pp. 1-8.

23. A. D. Jepson, D. J. Fleet, and M. J. Black, "A layered motion representation with occlusion and compact spatial support," in Computer Vision-ECCV 2002: 7th European Conference on Computer Vision Copenhagen, Denmark, May 28-31, 2002 Proceedings, Part I 7, 2002: Springer, pp. 692-706. [DOI:10.1007/3-540-47969-4_46]

24. Y. Weiss and E. H. Adelson, "A unified mixture framework for motion segmentation: Incorporating spatial coherence and estimating the number of models," in Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1996: IEEE, pp. 321-326. [DOI:10.1109/CVPR.1996.517092]

25. J. Wills, Agarwal, S., and Belongie, S., "What Went Where," CVPR, vol. v.1, pp. 37-44, 2003.

26. J. Xiao and M. Shah, "Motion layer extraction in the presence of occlusion using graph cuts," IEEE transactions on pattern analysis and machine intelligence, vol. 27, no. 10, pp. 1644-1659, 2005. [DOI:10.1109/TPAMI.2005.202]

27. P. Kohli, L. u. Ladický, and P. H. Torr, "Robust higher order potentials for enforcing label consistency," International Journal of Computer Vision, vol. 82, pp. 302-324, 2009. [DOI:10.1007/s11263-008-0202-0]

28. B. Yin, X. Zhang, Z. Li, L. Liu, M.-M. Cheng, and Q. Hou, "Dformer: Rethinking rgbd representation learning for semantic segmentation," arXiv preprint arXiv:2309.09668, 2023.

29. L. Zhong, C. Guo, J. Zhan, and J. Deng, "Attention-based fusion network for RGB-D semantic segmentation," Neurocomputing, vol. 608, p. 128371, 2024. [DOI:10.1016/j.neucom.2024.128371]

30. Z. Li, C. Lang, G. Li, T. Wang, and Y. Li, "Depth guided feature selection for RGBD salient object detection," Neurocomputing, vol. 519, pp. 57-68, 2023. [DOI:10.1016/j.neucom.2022.11.030]

31. Y. Tong, J. Chen, and Y. Wang, "Geometry-guided multilevel RGBD fusion for surface normal estimation," Computer Communications, vol. 206, pp. 73-84, 2023. [DOI:10.1016/j.comcom.2023.04.014]

32. B. Xiong, Y. Peng, J. Zhu, J. Gu, Z. Chen, and W. Qin, "AGWNet: Attention-guided adaptive shuffle channel gate warped feature network for indoor scene RGB-D semantic segmentation," Displays, p. 102730, 2024. [DOI:10.1016/j.displa.2024.102730]

33. N. Komodakis, G. Tziritas, and N. Paragios, "Fast, approximately optimal solutions for single and dynamic MRFs," in 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007: IEEE, pp. 1-8. [DOI:10.1109/CVPR.2007.383095]

34. S. Gupta, P. Arbelaez, and J. Malik, "Perceptual organization and recognition of indoor scenes from RGB-D images," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 564-571. [DOI:10.1109/CVPR.2013.79]

35. S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, "Learning rich features from RGB-D images for object detection and segmentation," in Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII 13, 2014: Springer, pp. 345-360. [DOI:10.1007/978-3-319-10584-0_23]

36. Y. Liu, O. Yoshie, and H. Watanabe, "Application of multi-modal fusion attention mechanism in semantic segmentation," in Proceedings of the Asian conference on computer vision, 2022, pp. 1245-1264. [DOI:10.1007/978-3-031-26293-7_23]

37. Y. Zhang, C. Xiong, J. Liu, X. Ye, and G. Sun, "Spatial-information guided adaptive context-aware network for efficient RGB-D semantic segmentation," IEEE Sensors Journal, 2023. [DOI:10.1109/JSEN.2023.3304637]

38. G. Zhang, J. Jia, T.-T. Wong, and H. Bao, "Consistent depth maps recovery from a video sequence," IEEE Transactions on pattern analysis and machine intelligence, vol. 31, no. 6, pp. 974-988, 2009. [DOI:10.1109/TPAMI.2009.52]

39. W. M. Rand, "Objective criteria for the evaluation of clustering methods," Journal of the American Statistical association, vol. 66, no. 336, pp. 846-850, 1971. [DOI:10.1080/01621459.1971.10482356]

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.