a Critical Survey on Content-Based & Semantic Image Retrieval – Abstract

Haji-Esmaeili, Mohammad Mahdi; Montazer, Gholamali

doi:10.61186/jsdp.22.1.113

Volume 22, Issue 1 (5-2025) JSDP 2025, 22(1): 113-141 | Back to browse issues page

‎ 10.61186/jsdp.22.1.113

Mendeley

Zotero

RefWorks

Haji-Esmaeili M M, Montazer G. a Critical Survey on Content-Based & Semantic Image Retrieval – Abstract. JSDP 2025; 22 (1) :113-141
URL: http://jsdp.rcisp.ac.ir/article-1-1432-en.html

a Critical Survey on Content-Based & Semantic Image Retrieval – Abstract

Mohammad Mahdi Haji-Esmaeili

, Gholamali Montazer ^*

Professor Faculty of Industrial and Systems Engineering, Tarbiat Modares University, Tehran, Iran

Abstract: (401 Views)

The rapid increase in the volume, diversity, and complexity of visual content in the digital world has made the need for designing and implementing visual content search and retrieval systems highly evident. Currently, we are facing a massive scale of visual data on the web, for which the conventional approaches based on manual and human-generated metadata are not sufficient to handle the diversity and sheer volume. The enormous volume of data generated on the web, without a high-accuracy and high-speed solution for understanding and retrieving it, will join the digital archives forever and never be found again. Recently, there have been significant efforts for retrieving these images, particularly in the fields of Content-Based Image Retrieval (CBIR) and Semantic Image Retrieval (SIR). Content-based and semantic image retrieval systems have the capability to search and retrieve images based on their internal content and high-level human-understandable semantics, rather than just the metadata that may be associated with them.
This paper provides a comprehensive review of the latest advancements in the field of content-based image retrieval in recent years. It aims to critically discuss the strengths and weaknesses of each research area in content-based retrieval, and provide an overall framework of this process and the progress made in areas such as image preprocessing, feature extraction and embedding, machine learning, benchmark datasets, similarity matching, and performance evaluation. Finally, the paper presents novel research approaches, challenges, and suggestions for better advancing research in this field.
The sections of the paper are organized as follows: After the introduction, Section 2 describes the components of a CBIR system framework, and with a cursory look at classical and traditional methods, it will delve into the workings of modern approaches and their associated challenges. In Section 3, we will provide an overview of the concept of "relevance feedback" and explain the need for this method to enhance the retrieval performance in CBIR systems, followed by an introduction to the prominent solutions in this domain. Finally, in Section 4, we will present a review of the image datasets commonly used in the field of content-based image retrieval, along with a discussion of their characteristics.
IGiven the recent advancements in the field of computer vision and image processing, especially in the area of "image-text relationship" and how to integrate the two to improve retrieval performance, the focus of a large part of this study has been on the solutions in this area and the performance of the prominent methods.
The current main research in this field is monopolized by large companies and organizations with access to vast financial resources, which has slowed down the progress of research and academic work in this field. These companies, with access to unimaginable data and financial resources, have trained well-known and sometimes unknown models on a very large scale (billions of images and texts), and after the training is complete, they have placed the final model in various web services without publishing the details of the research conducted. The important point is that the scale law applies in this field, and any entity that has more access to computational and storage resources will be able to train better and more accurate models, which has made it less possible for small research units and universities to enter this field and wait for the publication of research by the aforementioned organizations and companies. There is a dire need to introduce effective solutions in this field that require limited resources and are capable of achieving high accuracy and competitiveness with the massive models, with a fraction of the budget required to train them. This has happened in the field of large language models, and after two years, multiple research groups have been able to achieve the accuracy of the Chat-GPT4 language model from OpenAI and with the ability to run on home devices, and it is necessary for research in this field to shift from focusing on achieving accuracy with greater scale to focusing on achieving accuracy with lower cost, otherwise this field will remain in the monopoly of companies focused on greater profits

Keywords: Content-Based Image Retrieval, Image Processing, Computer Vision, Machine Learning, Deep Learning, Semantic Gap.

Full-Text [PDF 2405 kb] (183 Downloads)

Type of Study: Research | Subject: Paper
Received: 2024/07/13 | Accepted: 2025/03/15 | Published: 2025/06/21 | ePublished: 2025/06/21

References

1. L. Fei-Fei, A. Iyer, C. Koch, and P. Perona, "What do we perceive in a glance of a real-world scene?," J. Vis., vol. 7, no. 1, p. 10, Jan. 2007, doi: 10.1167/7.1.10. [DOI:10.1167/7.1.10] [PMID]

2. S. R. Dubey, "A Decade Survey of Content Based Image Retrieval Using Deep Learning," IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 5, pp. 2687-2704, 2022, doi: 10.1109/TCSVT.2021.3080920. [DOI:10.1109/TCSVT.2021.3080920]

3. A. Alzu'bi, A. Amira, and N. Ramzan, "Semantic content-based image retrieval: A comprehensive study," J. Vis. Commun. Image Represent., vol. 32, pp. 20-54, 2015, doi: 10.1016/j.jvcir.2015.07.012. [DOI:10.1016/j.jvcir.2015.07.012]

4. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: A neural image caption generator," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156-3164. doi: 10.1109/CVPR.2015.7298935. [DOI:10.1109/CVPR.2015.7298935]

5. C. Li et al., "mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections," Proc. 2022 Conf. Empir. Methods Nat. Lang. Process. EMNLP 2022, pp. 7241-7259, 2022. [DOI:10.18653/v1/2022.emnlp-main.488]

6. R. Mihalcea, "The multidisciplinary facets of research on humour," Appl. Fuzzy Sets Theory, pp. 412-421, 2007, doi: 10.1007/978-3-540-73400-0_52. [DOI:10.1007/978-3-540-73400-0_52]

7. A. Chandrasekaran et al., "We Are Humor Beings: Understanding and Predicting Visual Humor," Dec. 2015. [DOI:10.1109/CVPR.2016.498]

8. A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, "Hierarchical Text-Conditional Image Generation with CLIP Latents," 2022, [Online]. Available: http://arxiv.org/abs/2204.06125

9. "Internet LiveStats 2024." [Online]. Available: https://www.internetlivestats.com

10. "DemandSage 2024." [Online]. Available: https://www.demandsage.com/social-media-users/

11. "Gartner 2024." [Online]. Available: https://www.gartner.com

12. Y. Rui, T. S. Huang, and S. F. Chang, "Image retrieval: Current techniques, promising directions, and open issues," J. Vis. Commun. Image Represent., vol. 10, no. 1, pp. 39-62, 1999, doi: 10.1006/jvci.1999.0413. [DOI:10.1006/jvci.1999.0413]

13. D. Stan and I. K. Sethi, "Mapping low-level image features to semantic concepts," M. M. Yeung, C.-S. Li, and R. W. Lienhart, Eds., Jan. 2001, pp. 172-179. doi: 10.1117/12.410925. [DOI:10.1117/12.410925]

14. R. Datta, J. Li, and J. Z. Wang, "Content-based image retrieval: approaches and trends of the new age," in Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval, New York, NY, USA: ACM, Nov. 2005, pp. 253-262. doi: 10.1145/1101826.1101866. [DOI:10.1145/1101826.1101866]

15. A. Mojsilovic and B. Rogowitz, "Capturing image semantics with low-level descriptors," in Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205), IEEE, pp. 18-21. doi: 10.1109/ICIP.2001.958942. [DOI:10.1109/ICIP.2001.958942]

16. A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, "Content-based image retrieval at the end of the early years," IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 12, pp. 1349-1380, 2000, doi: 10.1109/34.895972. [DOI:10.1109/34.895972]

17. E. Cambria, A. Hussain, C. Havasi, and C. Eckl, "Common sense computing: from the society of mind to digital intuition and beyond," in European Workshop on Biometrics and Identity Management, Springer, 2009, pp. 252-259. [DOI:10.1007/978-3-642-04391-8_33]

18. E. Perez, H. de Vries, F. Strub, V. Dumoulin, and A. Courville, "Learning Visual Reasoning Without Strong Priors," 2017, [Online]. Available: http://arxiv.org/abs/1707.03017

19. J. Lu, C. Xiong, D. Parikh, and R. Socher, "Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning," 2016, doi: 10.1109/CVPR.2017.345. [DOI:10.1109/CVPR.2017.345]

20. H. Ling and S. Fidler, "Teaching Machines to Describe Images via Natural Language Feedback," 2017, [Online]. Available: http://arxiv.org/abs/1706.00130

21. A. Chandrasekaran et al., "We Are Humor Beings: Understanding and Predicting Visual Humor," Cvpr 2016, p. 17, Dec. 2015, doi: 10.1109/CVPR.2016.498. [DOI:10.1109/CVPR.2016.498]

22. D. Raposo, A. Santoro, R. Pascanu, T. Lillicrap, P. Battaglia, and U. Kingdom, "Discovering objects and their relations from entangled scene representations," Iclr, pp. 1-10, 2017.

23. R. Krishna et al., "Visual genome: Connecting language and vision using crowdsourced dense image annotations," Feb. 2016, [Online]. Available: http://arxiv.org/abs/1602.07332

24. L. Ladický, J. Shi, and M. Pollefeys, "Pulling things out of perspective," Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 89-96, 2014, doi: 10.1109/CVPR.2014.19. [DOI:10.1109/CVPR.2014.19]

25. T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, "Boosting Image Captioning with Attributes," 2016, [Online]. Available: http://arxiv.org/abs/1611.01646 [DOI:10.1109/ICCV.2017.524]

26. L. Ouyang et al., "Training language models to follow instructions with human feedback," Mar. 2022, [Online]. Available: http://arxiv.org/abs/2203.02155

27. N. Ghosh, S. Agrawal, and M. Motwani, "A Survey of Feature Extraction for Content-Based Image Retrieval System," Lect. Notes Networks Syst., vol. 34, pp. 305-313, 2018, doi: 10.1007/978-981-10-8198-9_32. [DOI:10.1007/978-981-10-8198-9_32]

28. R. C. Gonzalez and R. E. Woods, Digital Image Processing, Global Edition. Pearson Education, 2018.

29. J. Devlin et al., "Language Models for Image Captioning : The Quirks and What Works," Acl-2015, no. Me Lm, pp. 100-105, 2015, doi: 10.1103/PhysRevE.92.022112. [DOI:10.1103/PhysRevE.92.022112] [PMID]

30. A. Karpathy and F. F. Li, "Deep visual-semantic alignments for generating image descriptions," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, Jun. 2015, pp. 3128-3137. doi: 10.1109/CVPR.2015.7298932. [DOI:10.1109/CVPR.2015.7298932]

31. A. Karpathy, A. Joulin, and L. Fei-Fei, "Deep Fragment Embeddings for Bidirectional Image Sentence Mapping," in Proceedings of NIPS 2014, 2014, pp. 1-9. [Online]. Available: http://arxiv.org/abs/1406.5679

32. E. H. Huang, R. Socher, C. D. Manning, and A. Ng, "Improving word representations via global context and multiple word prototypes," Proc. 50th Annu. Meet. Assoc. Comput. Linguist. Long Pap. 1, pp. 873-882, 2012, [Online]. Available: http://nlp.stanford.edu/pubs/HuangACL12.pdf

33. T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Distributed Representations of Words and Phrases and their Compositionality," Nips, pp. 1-9, 2013, doi: 10.1162/jmlr.2003.3.4-5.951. [DOI:10.1162/jmlr.2003.3.4-5.951]

34. J. Graham, T. Cootes, C. Taylor, and D. Cooper, "Active shape models-their training and application," Comput. Vis. image Underst., vol. 61, 1995. [DOI:10.1006/cviu.1995.1004]

35. S. Ali and A. Madabhushi, "An integrated region-, boundary-, shape-based active contour for multiple object overlap resolution in histological imagery," IEEE Trans. Med. Imaging, vol. 31, no. 7, pp. 1448-1460, 2012, doi: 10.1109/TMI.2012.2190089. [DOI:10.1109/TMI.2012.2190089] [PMID]

36. Mehmet Sezgin Bu¨ lent Sankur, "Survey over image thresholding techniques and quantitative performance evaluation," J. Electron. Imaging, vol. 13, no. 1, pp. 146-165, 2004. [DOI:10.1117/1.1631315]

37. Y. Liang, M. Zhang, and W. N. Browne, "Image segmentation: A survey of methods based on evolutionary computation," Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8886, pp. 847-859, 2014, doi: 10.1007/978-3-319-13563-2_71. [DOI:10.1007/978-3-319-13563-2_71]

38. B. M. Carvalho, C. J. Gau, G. T. Herman, and T. Y. Kong, "Algorithms for Fuzzy Segmentation," Int. Conf. Adv. Pattern Recognit., pp. 154-163, 1999, doi: 10.1007/978-1-4471-0833-7_16. [DOI:10.1007/978-1-4471-0833-7_16]

39. B. M. Carvalho, G. T. Herman, and T. Y. Kong, "Simultaneous Fuzzy Segmentation of Multiple Objects," Electron. Notes Discret. Math., vol. 12, pp. 3-22, 2003, doi: 10.1016/S1571-0653(04)00470-6. [DOI:10.1016/S1571-0653(04)00470-6]

40. J. K. Udupa, P. K. Saha, and R. A. Lotufo, "Relative fuzzy connectedness and object definition: Theory, algorithms, and applications in image segmentation," IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 11, pp. 1485-1500, 2002, doi: 10.1109/TPAMI.2002.1046162. [DOI:10.1109/TPAMI.2002.1046162]

41. Hanan Same, "The Quadtree and Related Hierarchical Data Structures," ACM Comput. Surv., vol. 16, pp. 187-260, 1984. [DOI:10.1145/356924.356930]

42. J. P. Marques de Sá, "Structural Pattern Recognition," Pattern Recognit., pp. 243-289, 2001, doi: 10.1007/978-3-642-56651-6_6. [DOI:10.1007/978-3-642-56651-6_6]

43. I. Karoui, R. Fablet, J. M. Boucher, and J. M. Augustin, "Unsupervised region-based image segmentation using texture statistics and level-set methods," 2007 IEEE Int. Symp. Intell. Signal Process. WISP, 2007, doi: 10.1109/WISP.2007.4447617. [DOI:10.1109/WISP.2007.4447617]

44. L. Grady, "Multilabel random walker image segmentation using prior models," Proc. - 2005 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognition, CVPR 2005, vol. I, pp. 763-771, 2005, doi: 10.1109/CVPR.2005.239. [DOI:10.1109/CVPR.2005.239]

45. X. Yu and J. Yla-Jaaski, "A new algorithm for image segmentation based on region growing and edge detection," Proc. - IEEE Int. Symp. Circuits Syst., vol. 1, pp. 516-519, 1991, doi: 10.1109/iscas.1991.176386. [DOI:10.1109/ISCAS.1991.176386]

46. P. F. Felzenszwalb and D. P. Huttenlocher, "Efficient graph-based image segmentation," Int. J. Comput. Vis., vol. 59, no. 2, pp. 167-181, 2004, doi: 10.1023/B:VISI.0000022288.19776.77. [DOI:10.1023/B:VISI.0000022288.19776.77]

47. L. Lucchese and S. K. Mitray, "Color image segmentation: A state-of-the-art survey," Proc. Indian Natl. Sci. Acad. (INSA-A). Delhi, Indian Natl Sci Acad, vol. 67, pp. 207-221, 2001, [Online]. Available: http://ultra.sdk.free.fr/docs/Image-Processing/filters/Color Image Segmentation-A State-of-the-Art Survey.pdf%5Cnhttp://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.84.4896

48. R. Leahy, "An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Application to Image Segmentation," IEEE Trans. Pattern Anal. Mach. Intell., vol. 15, no. 11, pp. 1101-1113, 1993, doi: 10.1109/34.244673. [DOI:10.1109/34.244673]

49. L. Grady and G. Funka-Lea, "Multi-label image segmentation for medical applications based on graph-theoretic electrical potentials," Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 3117, pp. 230-245, 2004, doi: 10.1007/978-3-540-27816-0_20. [DOI:10.1007/978-3-540-27816-0_20]

50. L. Grady, "Random Walks for Image Segmentation," IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 11, pp. 1768-1783, 2006. [DOI:10.1109/TPAMI.2006.233] [PMID]

51. C. H. Lampert, M. B. Blaschko, and T. Hofmann, "Beyond sliding windows: Object localization by efficient subwindow search," 26th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR, 2008, doi: 10.1109/CVPR.2008.4587586. [DOI:10.1109/CVPR.2008.4587586]

52. C. H. Lin, R. T. Chen, and Y. K. Chan, "A smart content-based image retrieval system based on color and texture feature," Image Vis. Comput., vol. 27, no. 6, pp. 658-665, 2009, doi: 10.1016/j.imavis.2008.07.004. [DOI:10.1016/j.imavis.2008.07.004]

53. A. Torralba, R. Fergus, and Y. Weiss, "Small codes and large image databases for recognition," 26th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR, 2008, doi: 10.1109/CVPR.2008.4587633. [DOI:10.1109/CVPR.2008.4587633]

54. V. Chandrasekhar, G. Takacs, D. Chen, S. Tsai, R. Grzeszczuk, and B. Girod, "CHoG: Compressed histogram of gradients A low bit-rate feature descriptor," pp. 2504-2511, 2010, doi: 10.1109/cvpr.2009.5206733. [DOI:10.1109/CVPR.2009.5206733]

55. Z. Guo, L. Zhang, and D. Zhang, "Rotation invariant texture classification using LBP variance (LBPV) with global matching," Pattern Recognit., vol. 43, no. 3, pp. 706-719, 2010. [DOI:10.1016/j.patcog.2009.08.017]

56. B. Zhang, Y. Gao, S. Zhao, and J. Liu, "Local derivative pattern versus local binary pattern: face recognition with high-order local pattern descriptor," IEEE Trans. Image Process., vol. 19, no. 2, pp. 533-544, 2010. [DOI:10.1109/TIP.2009.2035882] [PMID]

57. X. Tan and B. Triggs, "Enhanced local texture feature sets for face recognition under difficult lighting conditions," Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 4778 LNCS, pp. 168-182, 2007, doi: 10.1007/978-3-540-75690-3_13. [DOI:10.1007/978-3-540-75690-3_13]

58. X. Yang and K. T. Cheng, "Accelerating SURF detector on mobile devices," MM 2012 - Proc. 20th ACM Int. Conf. Multimed., pp. 569-578, 2012, doi: 10.1145/2393347.2393427. [DOI:10.1145/2393347.2393427]

59. K. Mikolajczyk and C. Schmid, "A performance evaluation of local descriptors," Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2, 2003, doi: 10.1109/cvpr.2003.1211478. [DOI:10.1109/CVPR.2003.1211478]

60. J. Johnson, A. Karpathy, and L. Fei-Fei, "DenseCap: Fully Convolutional Localization Networks for Dense Captioning," arXiv Prepr., 2015, doi: 10.1109/CVPR.2016.494. [DOI:10.1109/CVPR.2016.494] [PMID]

61. K. Xu, J. L. B. R. Kiros, K. C. A. Courville, and R. S. R. S. Z. Y. Bengio, "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention," IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 157-166, Feb. 2015, doi: 10.1109/72.279181. [DOI:10.1109/72.279181] [PMID]

62. J. Mao et al., "Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)," To Appear ICLR-2015, vol. 1090, no. 2014, pp. 1-14, 2015, [Online]. Available: http://cbmm.mit.edu/sites/default/files/publications/CBMM Memo 033.pdf

63. Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, "Image Captioning with Semantic Attention," Cvpr, p. 10, 2016, doi: 10.1109/CVPR.2016.503. [DOI:10.1109/CVPR.2016.503]

64. K. Cho et al., "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation," Proc. 2014 Conf. Empir. Methods Nat. Lang. Process., pp. 1724-1734, 2014, doi: 10.3115/v1/D14-1179. [DOI:10.3115/v1/D14-1179]

65. J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick, "Exploring Nearest Neighbor Approaches for Image Captioning," arXiv Prepr., pp. 1-6, 2015, [Online]. Available: http://arxiv.org/abs/1505.04467

66. M. Kolář, M. Hradiš, and P. Zemčík, "Technical Report: Image Captioning with Semantically Similar Images," p. 3, 2015, [Online]. Available: http://arxiv.org/abs/1506.03995

67. J.-B. Michel et al., "Quantitative analysis of culture using millions of digitized books.," Science, vol. 331, no. 6014, pp. 176-82, 2011, doi: 10.1126/science.1199644. [DOI:10.1126/science.1199644] [PMID] []

68. M. J. Choi, A. Torralba, and A. S. Willsky, "Context models and out-of-context objects," Pattern Recognit. Lett., vol. 33, no. 7, pp. 853-862, 2012, doi: 10.1016/j.patrec.2011.12.004. [DOI:10.1016/j.patrec.2011.12.004]

69. S. Woo, J. Park, J. Y. Lee, and I. S. Kweon, "CBAM: Convolutional block attention module," Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11211 LNCS, pp. 3-19, 2018, doi: 10.1007/978-3-030-01234-2_1. [DOI:10.1007/978-3-030-01234-2_1]

70. A. Zhang et al., "Fine-Grained Scene Graph Generation with Data Transfer," Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 13687 LNCS, pp. 409-424, 2022, doi: 10.1007/978-3-031-19812-0_24. [DOI:10.1007/978-3-031-19812-0_24]

71. J. Kim, J. Park, J. Park, J. Kim, S. Kim, and H. J. Kim, "Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection," 2024, [Online]. Available: http://arxiv.org/abs/2403.17709 [DOI:10.1109/CVPR52733.2024.02660]

72. A. Krizhevsky, Ii. Sulskever, and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," in Nips, 2012, pp. 1-9.

73. M. D. Zeiler and R. Fergus, "Visualizing and understanding convolutional networks," in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8689 LNCS, no. PART 1, Springer, 2014, pp. 818-833. doi: 10.1007/978-3-319-10590-1_53. [DOI:10.1007/978-3-319-10590-1_53]

74. K. Fukushima, "Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position," Biol. Cybern., vol. 36, no. 4, pp. 193-202, 1980. [DOI:10.1007/BF00344251] [PMID]

75. D. Lay, Linear algebra and its applications, vol. 41. 2016. doi: 10.1016/0024-3795(81)90106-3. [DOI:10.1016/0024-3795(81)90106-3]

76. Schrutka, "Geometrie der Zahlen," Monatshefte für Math. und Phys., vol. 22, no. 1, pp. A30-A30, 1911, doi: 10.1007/bf01742861. [DOI:10.1007/BF01742861]

77. Amit Singhal, "Modern information retrieval: a brief overview," Bull. Ieee Comput. Soc. Tech. Comm. Data Eng., vol. 24, 2001.

78. H. Minkowski, "The Fundamental Equations for Electromagnetic Processes in Moving Bodies," Math. Klasse, pp. 53-111, 1908.

79. É.O. Rodrigues, "Combining Minkowski and Chebyshev: New distance proposal and survey of distance metrics using k-nearest neighbours classifier," Pattern Recognit. Lett., vol. 110, pp. 66-71, 2018. [DOI:10.1016/j.patrec.2018.03.021]

80. R. W. Hamming, "Error Detecting and Error Correcting Codes," Bell Syst. Tech. J., vol. 29, no. 2, pp. 147-160, 1950, doi: 10.1002/j.1538-7305.1950.tb00463.x. [DOI:10.1002/j.1538-7305.1950.tb00463.x]

81. I. Levenshtein, Vladimir, "Binary Codes Capable of Correcting Deletions, Insertions and Reversals," Sov. Phys. Dokl., vol. 10, p. 707, 1966.

82. P. Jaccard, "Étude comparative de la distribution florale dans une portion des Alpes et des Jura," Bull. la Société Vaudoise des Sci. Nat., vol. 37, pp. 547-579, 1901.

83. G. Van Brummelen, Heavenly mathematics: The forgotten art of spherical trigonometry. 2012. doi: 10.33137/aestimatio.v11i0.26065. [DOI:10.33137/aestimatio.v11i0.26065]

84. P. C. Mahalanobis, "On the Generalised Distance in Statistics," Proc. Natl. Inst. Sci. India, vol. 2, pp. 49-55, 1936.

85. K. Pearson, Mathematical contributions to the theory of evolution, vol. 60, no. 1834. 1896. [Online]. Available: http://books.google.com/books?hl=en&lr=&id=aIU_AQAAIAAJ&oi=fnd&pg=PA1&dq=Mathematical+Contributions+to+the+Theory+of+Evolution&ots=6q0ynawAzT&sig=FdqqMWpdG0a5gRGfvPbW2BRUw8I

86. Manning C.D. and Schutze H., "Foundations of statistical natural language processing.," MIT Press, 1999.

87. C. Shen, S. Panda, and J. T. Vogelstein, "The Chi-Square Test of Distance Correlation," J. Comput. Graph. Stat., vol. 31, no. 1, pp. 254-262, 2022, doi: 10.1080/10618600.2021.1938585. [DOI:10.1080/10618600.2021.1938585] [PMID] []

88. C. Spearman, "The Proof and Measurement of Association between Two Things," Am. J. Psychol., vol. 15, no. 1, p. 72, 1904, doi: 10.2307/1412159. [DOI:10.2307/1412159]

89. G. N. Lance and W. T. Williams, Computer Programs for Hierarchical Polythetic Classification ("Similarity Analyses"), vol. 9, no. 1. 1966. doi: 10.1093/comjnl/9.1.60. [DOI:10.1093/comjnl/9.1.60]

90. J. Shlens, "Notes on Kullback-Leibler Divergence and Likelihood," 2014, [Online]. Available: http://arxiv.org/abs/1404.2000

91. G. Qian, S. Sural, Y. Gu, and S. Pramanik, "Similarity between euclidean and cosine angle distance for nearest neighbor queries," Proc. ACM Symp. Appl. Comput., vol. 2, pp. 1232-1237, 2004, doi: 10.1145/967900.968151. [DOI:10.1145/967900.968151]

92. S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Comput., vol. 9, no. 8, pp. 1735-1780, 1997. [DOI:10.1162/neco.1997.9.8.1735] [PMID]

93. Y. Merri, Bart Van; Bengio, "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation," 1997.

94. A. Vaswani et al., "Attention Is All You Need," Adv. Neural Inf. Process. Syst., vol. 30, Jun. 2017, [Online]. Available: http://arxiv.org/abs/1706.03762

95. J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," NAACL HLT 2019 - 2019 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. - Proc. Conf., vol. 1, pp. 4171-4186, 2019.

96. A. Radford et al., "Learning Transferable Visual Models From Natural Language Supervision," Feb. 2021, [Online]. Available: http://arxiv.org/abs/2103.00020

97. Z. Jiang et al., "MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs," 2024, [Online]. Available: http://arxiv.org/abs/2402.15627

98. D. MEYER, "The cost of training AI could soon become too much to bear." [Online]. Available: https://fortune.com/2024/04/04/ai-training-costs-how-much-is-too-much-openai-gpt-anthropic-microsoft/

99. J. Johnson, A. Gupta, and L. Fei-Fei, "Image Generation from Scene Graphs," 2018, [Online]. Available: http://arxiv.org/abs/1804.01622 [DOI:10.1109/CVPR.2018.00133]

100. X. An et al., "Unicom: Universal and Compact Representation Learning for Image Retrieval," 2023, [Online]. Available: http://arxiv.org/abs/2304.05884

101. J. Deng, J. Guo, N. Xue, and S. Zafeiriou, "ArcFace: Additive Angular Margin Loss for Deep Face Recognition," 2018, [Online]. Available: http://arxiv.org/abs/1801.07698 [DOI:10.1109/CVPR.2019.00482] [PMID] []

102. B. Dhingra, H. Liu, R. Salakhutdinov, and W. W. Cohen, "A Comparative Study of Word Embeddings for Reading Comprehension," 2017, [Online]. Available: http://arxiv.org/abs/1703.00993

103. Q. Guo et al., "M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining," Jan. 2024, [Online]. Available: http://arxiv.org/abs/2401.15896

104. J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, "CoCa: Contrastive Captioners are Image-Text Foundation Models," 2022, [Online]. Available: http://arxiv.org/abs/2205.01917

105. M. Wortsman et al., "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time," Proc. Mach. Learn. Res., vol. 162, pp. 23965-23998, 2022.

106. M. Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision," 2023, [Online]. Available: http://arxiv.org/abs/2304.07193

107. H. Pham et al., "Combined scaling for zero-shot transfer learning," Neurocomputing, vol. 555, 2023, doi: 10.1016/j.neucom.2023.126658. [DOI:10.1016/j.neucom.2023.126658]

108. M. Dehghani et al., "Scaling Vision Transformers to 22 Billion Parameters," Proc. Mach. Learn. Res., vol. 202, pp. 7480-7512, 2023.

109. B. Alkin, L. Miklautz, S. Hochreiter, and J. Brandstetter, "MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations," 2024, [Online]. Available: http://arxiv.org/abs/2402.10093

110. X. An et al., "Unicom: Universal and Compact Representation Learning for Image Retrieval," Apr. 2023, [Online]. Available: http://arxiv.org/abs/2304.05884

111. X. S. Zhou, T. S. Huang, and N. M. Ave, "Relevance Feedback in Image Retrieval: A Comprehensive Review," ACM Multimed. Syst. J., vol. 544, no. 2003, pp. 536-544, 2001. [DOI:10.1007/s00530-002-0070-3]

112. C. Cortes and V. Vapnik, "Support-vector networks," Mach. Learn., vol. 20, no. 3, pp. 273-297, 1995. [DOI:10.1007/BF00994018]

113. L. Zhang, L. Wang, and W. Lin, "Semisupervised biased maximum margin analysis for interactive image retrieval," IEEE Trans. Image Process., vol. 21, no. 4, pp. 2294-2308, 2012, doi: 10.1109/TIP.2011.2177846. [DOI:10.1109/TIP.2011.2177846] [PMID]

114. W. Bian and D. Tao, "Biased discriminant euclidean embedding for content-based image retrieval," IEEE Trans. Image Process., vol. 19, no. 2, pp. 545-554, 2010, doi: 10.1109/TIP.2009.2035223. [DOI:10.1109/TIP.2009.2035223] [PMID]

115. G. T. Ngo, T. Q. Ngo, and D. D. Nguyen, "Image Retrieval with Relevance Feedback using SVM Active Learning," Int. J. Electr. Comput. Eng., vol. 6, no. 6, p. 3238, 2016, doi: 10.11591/ijece.v6i6.pp3238-3246. [DOI:10.11591/ijece.v6i6.pp3238-3246]

116. X. S. Zhou and T. S. Huang, "Small sample learning during multimedia retrieval using BiasMap," Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 1, 2001, doi: 10.1109/cvpr.2001.990450. [DOI:10.1109/CVPR.2001.990450]

117. C. D. Ferreira, J. A. Santos, R. Da, M. A. Gonalves, R. C. Rezende, and W. Fan, "Relevance feedback based on genetic programming for image retrieval," Pattern Recognit. Lett., vol. 32, no. 1, pp. 27-37, 2011, doi: 10.1016/j.patrec.2010.05.015. [DOI:10.1016/j.patrec.2010.05.015]

118. C. Schuhmann et al., "LAION-5B: An open large-scale dataset for training next generation image-text models," 2022, [Online]. Available: http://arxiv.org/abs/2210.08402

119. C. Schuhmann et al., "LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs," 2021, [Online]. Available: http://arxiv.org/abs/2111.02114

120. K. Srinivasan, K. Raman, J. Chen, M. Bendersky, and M. Najork, "WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning," SIGIR 2021 - Proc. 44th Int. ACM SIGIR Conf. Res. Dev. Inf. Retr., pp. 2443-2449, 2021, doi: 10.1145/3404835.3463257. [DOI:10.1145/3404835.3463257]

121. O. Russakovsky et al., "ImageNet Large Scale Visual Recognition Challenge," Int. J. Comput. Vis., vol. 115, no. 3, pp. 211-252, 2015, doi: 10.1007/s11263-015-0816-y. [DOI:10.1007/s11263-015-0816-y]

122. P. Sharma, N. Ding, S. Goodman, and R. Soricut, "Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning," ACL 2018 - 56th Annu. Meet. Assoc. Comput. Linguist. Proc. Conf. (Long Pap., vol. 1, pp. 2556-2565, 2018, doi: 10.18653/v1/p18-1238. [DOI:10.18653/v1/P18-1238] [PMID] []

123. Y. Goyal, T. Khot, A. Agrawal, D. Summers-Stay, D. Batra, and D. Parikh, "Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering," Int. J. Comput. Vis., vol. 127, no. 4, pp. 398-414, 2019, doi: 10.1007/s11263-018-1116-0. [DOI:10.1007/s11263-018-1116-0]

124. T. Y. Lin et al., "Microsoft COCO: Common objects in context," Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8693 LNCS, no. PART 5, pp. 740-755, 2014, doi: 10.1007/978-3-319-10602-1_48. [DOI:10.1007/978-3-319-10602-1_48]

125. A. Krizhevsky, "Learning Multiple Layers of Features from Tiny Images," … Sci. Dep. Univ. Toronto, Tech. …, pp. 1-60, 2009, doi: 10.1.1.222.9220.

126. P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, "From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions," Trans. Assoc. Comput. Linguist., vol. 2, no. April, pp. 67-78, 2014, [Online]. Available: http://nlp.cs.illinois.edu/HockenmaierGroup/Papers/DenotationGraph.pdf [DOI:10.1162/tacl_a_00166]

127. M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, "The pascal visual object classes (VOC) challenge," Int. J. Comput. Vis., vol. 88, no. 2, pp. 303-338, 2010, doi: 10.1007/s11263-009-0275-4. [DOI:10.1007/s11263-009-0275-4]

128. H. Touvron et al., "LLaMA: Open and Efficient Foundation Language Models," Feb. 2023, [Online]. Available: http://arxiv.org/abs /2302.13971

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote