A survey on short text similarity measurement methods

Rabiei Zadeh, Ahmad; Amirkhani, Hossein

doi:10.61186/jsdp.20.3.103

Volume 20, Issue 3 (12-2023) JSDP 2023, 20(3): 103-126 | Back to browse issues page

‎ 10.61186/jsdp.20.3.103

Mendeley

Zotero

RefWorks

Rabiei Zadeh A, Amirkhani H. A survey on short text similarity measurement methods. JSDP 2023; 20 (3) : 8
URL: http://jsdp.rcisp.ac.ir/article-1-1307-en.html

A survey on short text similarity measurement methods

Ahmad Rabiei Zadeh ^*

, Hossein Amirkhani

AI Laboratory of Computer Research Center of Islamic Science (Noor)

Abstract: (1733 Views)

Measuring similarity between two text snippets is one of the essential tasks in many NLP problems and it has been still one of the most challenging tasks in the field. Various methods have been proposed to measure text similarity. This survey reviews more than 150 of the related papers, introduces a comprehensive taxonomy with three main categories, and discusses the advantages and disadvantages of these methods. The first category is lexical methods that only focus on text pair’s surface similarity. These methods consider the text as a sequence of characters, tokens, or a mixture of these two. Some recent studies use deep learning techniques for detecting lexical similarity in alias detection task. The second category is semantic methods that take into consideration the meaning of the words based on some pre-prepared knowledge-bases like Wordnet or using Corpus-based methods. Some recent studies use modern deep learning techniques like transformers and Siamese networks to create document embedding that outperform other methods. The final category is hybrid methods that take advantage of all other methods even syntactic parsing in some cases. Note that high-quality syntactic parsers are not present for many languages and that using them has some side-effects on performance and speed.

Article number: 8

Keywords: short text similarity, lexical similarity, semantic similarity, natural language processing, sentence embedding, transformer

Full-Text [PDF 1099 kb] (507 Downloads)

Type of Study: Research | Subject: Paper
Received: 2022/04/20 | Accepted: 2023/02/22 | Published: 2024/01/14 | ePublished: 2024/01/14

References

1. [1] W. Liu et al., "Semantic Matching from Different Perspectives," 2022, [Online]. Available: https://arxiv.org/abs/2202.06517.

2. [2] D. B. Bisandu, R. Prasad, and M. M. Liman, "Data clustering using efficient similarity measures," J. Stat. Manag. Syst., vol. 22, no. 5, pp. 901-922, 2019, doi: 10.1080/09720510.2019.1565443. [DOI:10.1080/09720510.2019.1565443]

3. [3] E. Zafarani-Moattar, M. R. Kangavari, and A. M. Rahmani, "A Comparative Study on Transfer Learning and Distance Metrics in Semantic Clustering over the COVID-19 Tweets," pp. 1-22, 2021, [Online]. Available: http://arxiv.org/abs/2111.08658.

4. [4] H. A. Mohamed Hassan, G. Sansonetti, F. Gasparetti, A. Micarelli, and J. Beel, "BERT, ELMo, use and infersent sentence encoders: The Panacea for research-paper recommendation?," CEUR Workshop Proc., vol. 2431, no. September, pp. 6-10, 2019.

5. [5] S. Abujar, M. Hasan, and S. A. Hossain, "Sentence similarity estimation for text summarization using deep learning," in Advances in Intelligent Systems and Computing, 2019, vol. 828, no. January, pp. 155-164, doi: 10.1007/978-981-13-1610-4_16. [DOI:10.1007/978-981-13-1610-4_16]

6. [6] A. A. Aliane and H. Aliane, "Evaluating SIAMESE Architecture Neural Models for Arabic Textual Similarity and Plagiarism Detection," ISIA 2020 - Proceedings, 4th Int. Symp. Informatics its Appl., 2020, doi: 10.1109/ISIA51297.2020.9416550. [DOI:10.1109/ISIA51297.2020.9416550]

7. [7] A. Almiman, N. Osman, and M. Torki, "Deep neural network approach for arabic community question answering," Alexandria Eng. J., vol. 59, no. 6, pp. 4427-4434, 2020, doi: 10.1016/j.aej.2020.07.048. [DOI:10.1016/j.aej.2020.07.048]

8. [8] P. Huang et al., "Learning Deep Structured Semantic Models for Web Search using Clickthrough Data," Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 9626, no. 2012, pp. 115-128, 2016, [Online]. Available: http://dl.acm.org/citation.cfm?doid=2983323.2983818.

9. [9] C. Sung, T. I. Dhamecha, and N. Mukhi, Improving short answer grading using transformer-based pre-training, vol. 11625 LNAI. Springer International Publishing, 2019.

10. [10] M. Hasanain, F. Haouari, R. Suwaileh, and Z. S. Ali, "Overview of CheckThat ! 2020 Arabic : Automatic Identification and Verification of Claims in Social Media," pp. 22-25, 2020.

11. [11] W. H. Gomaa and A. A. Fahmy, "A Survey of Text Similarity Approaches," Int. J. Comput. Appl., vol. 68, no. 13, pp. 13-18, 2013.

12. [12] M. Farouk, "Measuring Sentences Similarity: A Survey," Indian J. Sci. Technol., vol. 12, no. 25, pp. 1-11, 2019, doi: 10.17485/ijst/2019/v12i25/143977. [DOI:10.17485/ijst/2019/v12i25/143977]

13. [13] M. Alian and A. Awajan, "Semantic Similarity for English and Arabic Texts: A Review," J. Inf. Knowl. Manag., vol. 19, no. 4, 2020, doi: 10.1142/S0219649220500331. [DOI:10.1142/S0219649220500331]

14. [14] S. K. Gaddipati, "R & D Project Comparative Evaluation of Transfer Learning Models in Semantic Text Similarity Sasi Kiran Gaddipati," no. November, 2020, doi: 10.13140/RG.2.2.34085.12003.

15. [15] A. Abo-Elghit, A. Al-Zoghby, and T. Hamza, "Textual Similarity Measurement Approaches: A Survey (1)," Egypt. J. Lang. Eng., vol. 0, no. 0, pp. 0-0, 2020, doi: 10.21608/ejle.2020.42018.1012. [DOI:10.21608/ejle.2020.42018.1012]

16. [16] D. W. Prakoso, A. Abdi, and C. Amrit, "Short text similarity measurement methods: a review," Soft Comput., vol. 25, no. 6, pp. 4699-4723, 2021, doi: 10.1007/s00500-020-05479-2. [DOI:10.1007/s00500-020-05479-2]

17. [17] D. Chandrasekaran and V. Mago, "Evolution of Semantic Similarity-A Survey," ACM Comput. Surv., vol. 54, no. 2, pp. 1-35, 2021, doi: 10.1145/3440755. [DOI:10.1145/3440755]

18. [18] R. W. Hamming, "Error Detecting and Error Correcting Codes," J. Franklin Inst., vol. 196, no. 4, pp. 519-520, 1923.

19. [19] P. A. V. Hall and G. R. Dowling, "Approximate String Matching," ACM Comput. Surv., vol. 12, no. 4, pp. 381-402, 1980, doi: 10.1145/356827.356830. [DOI:10.1145/356827.356830]

20. [20] M. A. Jaro, "Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida," J. Am. Stat. Assoc., vol. 84, no. 406, pp. 414-420, 1989, doi: 10.1080/01621459.1989.10478785. [DOI:10.1080/01621459.1989.10478785]

21. [21] S. B. Needleman and C. D. Wunsch, "A general method applicable to the search for similarities in the amino acid sequence of two proteins," J. Mol. Biol., vol. 48, no. 3, pp. 443-453, 1970, doi: https://doi.org/10.1016/0022-2836(70)90057-4 [DOI:10.1016/0022-2836(70)90057-4.] [PMID]

22. [22] M. Bilenko and R. J. Mooney, "Adaptive duplicate detection using learnable string similarity measures," Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., pp. 39-48, 2003, doi: 10.1145/956750.956759. [DOI:10.1145/956750.956759]

23. [23] A. McCallum, K. Bellare, and F. Pereira, "A conditional random field for discriminatively-trained finite-state string edit distance," Proc. 21st Conf. Uncertain. Artif. Intell. UAI 2005, pp. 388-395, 2005.

24. [24] D. Tam, N. Monath, A. Kobren, A. Traylor, R. Das, and A. McCallum, "Optimal transport-based alignment of learned character representations for string similarity," in ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 2020, pp. 5907-5917, doi: 10.18653/v1/p19-1592. [DOI:10.18653/v1/P19-1592]

25. [25] P. Shrestha, "Corpus-Based methods for Short Text Similarity," Rencontre des Étudiants Cherch. en Inform. pour le Trait. Autom. des Langues, vol. 2, 2011, [Online]. Available: http://hal.archives-ouvertes.fr/hal-00609909.

26. [26] M. A. Al-Ramahi and S. H. Mustafa, "N-Gram-Based Techniques for Arabic Text Document Matching; Case Study: Courses Accreditation," Basic Sci. Eng., vol. 21, no. 1, pp. 85-105, 2012, [Online]. Available: http://journals.yu.edu.jo/aybse/Issues/Vol21No1_2013/07.pdf.

27. [27] M. O. Alhawarat, H. Abdeljaber, and A. Hilal, "Effect of Stemming on Text Similarity for Arabic Language at Sentence Level," PeerJ Comput. Sci., vol. 7, pp. 1-18, 2021, doi: 10.7717/PEERJ-CS.530. [DOI:10.7717/peerj-cs.530] [PMID] []

28. [28] D. Gusfield, "Algorithms on stings, trees, and sequences: Computer science and computational biology," Acm Sigact News, vol. 28, no. 4, pp. 41-60, 1997.

29. [29] R. A. Wagner and M. J. Fischer, "The string-to-string correction problem," J. ACM, vol. 21, no. 1, pp. 168-173, 1974.

30. [30] A. E. Monge and C. P. Elkan, "The field matching problem: Algorithms and applications," Proc. Second Int. Conf. Knowl. Discov. Data Min., no. Slaven 1992, pp. 267-270, 1996.

31. [31] William W Cohen, Pradeep Ravikumar, and Stephen, "A Comparison of String Distance Metrics for Matching Names and Records," Proc. IJCAI-2003 Work., pp. 73--78, 2003.

32. [32] J. Wang, G. Li, and J. Fe, "Fast-join: An efficient method for fuzzy token matching based string similarity join," Proc. - Int. Conf. Data Eng., pp. 458-469, 2011, doi: 10.1109/ICDE.2011.5767865. [DOI:10.1109/ICDE.2011.5767865]

33. [33] G. A. Miller, "WordNet: a lexical database for English," Commun. ACM, vol. 38, no. 11, pp. 39-41, 1995.

34. [34] M. Shamsfard, "Developing FarsNet: A lexical ontology for Persian," GWC 2008, p. 413, 2007.

35. [35] W. Black et al., "Introducing the Arabic wordnet project," in Proceedings of the third international WordNet conference, 2006, pp. 295-300.

36. [36] R. Rada, H. Mili, E. Bicknell, and M. Blettner, "Development and application of a metric on semantic nets," IEEE Trans. Syst. Man. Cybern., vol. 19, no. 1, pp. 17-30, 1989.

37. [37] Z. Wu and M. Palmer, "Verbs semantics and lexical selection," in Proceedings of the 32nd annual meeting on Association for Computational Linguistics -, 1994, pp. 133-138, doi: 10.3115/981732.981751. [DOI:10.3115/981732.981751]

38. [38] C. Leacock and M. Chodorow, "Combining local context and WordNet similarity for word sense identification," WordNet An Electron. Lex. database, vol. 49, no. 2, pp. 265-283, 1998.

39. [39] Y. Bin, L. Xiao-Ran, L. Ning, and Y. Yue-Song, "Using Information Content to Evaluate Semantic Similarity on HowNet," in 2012 Eighth International Conference on Computational Intelligence and Security, Nov. 2012, pp. 142-145, doi: 10.1109/CIS.2012.39. [DOI:10.1109/CIS.2012.39]

40. [40] D. Lin and others, "An information-theoretic definition of similarity.," in Icml, 1998, vol. 98, no. 1998, pp. 296-304.

41. [41] S. Banerjee and T. Pedersen, "An adapted Lesk algorithm for word sense disambiguation using WordNet," in International conference on intelligent text processing and computational linguistics, 2002, pp. 136-145.

42. [42] J.-B. Gao, B.-W. Zhang, and X.-H. Chen, "A WordNet-based semantic similarity measurement combining edge-counting and information content theory," Eng. Appl. Artif. Intell., vol. 39, pp. 80-88, 2015.

43. [43] G. Zhu and C. A. Iglesias, "Computing semantic similarity of concepts in knowledge graphs," IEEE Trans. Knowl. Data Eng., vol. 29, no. 1, pp. 72-85, 2016.

44. [44] C. Saedi, A. Branco, J. António Rodrigues, and J. Silva, "WordNet Embeddings," pp. 122-131, 2019, doi: 10.18653/v1/w18-3016. [DOI:10.18653/v1/W18-3016]

45. [45] S. Jimenez, F. A. Gonzalez, A. Gelbukh, and G. Duenas, "Word2set: WordNet-Based Word Representation Rivaling Neural Word Embedding for Lexical Similarity and Sentiment Analysis," IEEE Comput. Intell. Mag., vol. 14, no. 2, pp. 41-53, 2019, doi: 10.1109/MCI.2019.2901085. [DOI:10.1109/MCI.2019.2901085]

46. [46] R. Mihalcea, C. Corley, C. Strapparava, and others, "Corpus-based and knowledge-based measures of text semantic similarity," in Aaai, 2006, vol. 6, no. 2006, pp. 775-780.

47. [47] Y. Li, H. Li, Q. Cai, and D. Han, "A novel semantic similarity measure within sentences," in Proceedings of 2012 2nd international conference on computer science and network technology, 2012, pp. 1176-1179.

48. [48] D. Croft, S. Coupland, J. Shell, and S. Brown, "A fast and efficient semantic short text similarity metric," in 2013 13th UK workshop on computational intelligence (UKCI), 2013, pp. 221-227.

49. [49] N. Adel, K. Crockett, A. Crispin, D. Chandran, and J. P. Carvalho, "FUSE (Fuzzy Similarity Measure) - A measure for determining fuzzy short text similarity using Interval Type-2 fuzzy sets," IEEE Int. Conf. Fuzzy Syst., vol. 2018-July, 2018, doi: 10.1109/FUZZ-IEEE.2018.8491641. [DOI:10.1109/FUZZ-IEEE.2018.8491641]

50. [50] J. R. Firth, "Personality and language in society," Sociol. Rev., vol. 42, no. 1, pp. 37-52, 1950.

51. [51] K. Lund and C. Burgess, "Producing high-dimensional semantic spaces from lexical co-occurrence," Behav. Res. Methods, Instruments, Comput., vol. 28, no. 2, pp. 203-208, 1996, doi: 10.3758/BF03204766. [DOI:10.3758/BF03204766]

52. [52] T. K. Landauer and S. T. Dumais, "A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge," Psychol. Rev., vol. 104, no. 2, pp. 211-240, 1997, [Online]. Available: http://www.indiana.edu/~pcl/rgoldsto/courses/concepts/landauer.pdf.

53. [53] J. O'Shea, Z. Bandar, K. Crockett, and D. McLean, "A comparative study of two short text semantic similarity measures," Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 4953 LNAI, no. May 2014, pp. 172-181, 2008, doi: 10.1007/978-3-540-78582-8_18. [DOI:10.1007/978-3-540-78582-8_18]

54. [54] V. Rus, N. Niraula, and R. Banjade, "Similarity measures based on Latent Dirichlet Allocation," Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 7816 LNCS, no. PART 1, pp. 459-470, 2013, doi: 10.1007/978-3-642-37247-6_37. [DOI:10.1007/978-3-642-37247-6_37]

55. [55] P. D. Turney, "Mining the web for synonyms: PMI-IR versus LSA on TOEFL," Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 2167, pp. 491-502, 2001, doi: 10.1007/3-540-44795-4_42. [DOI:10.1007/3-540-44795-4_42]

56. [56] R. L. Cilibrasi and P. M. B. Vitányi, "The Google similarity distance," IEEE Trans. Knowl. Data Eng., vol. 19, no. 3, pp. 370-383, 2007, doi: 10.1109/TKDE.2007.48. [DOI:10.1109/TKDE.2007.48]

57. [57] E. Gabrilovich, S. Markovitch, and others, "Computing semantic relatedness using Wikipedia-based explicit semantic analysis.," in IJcAI, 2007, vol. 7, pp. 1606-1611.

58. [58] D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet allocation," J. Mach. Learn. Res., vol. 3, pp. 993-1022, 2003.

59. [59] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," 1st Int. Conf. Learn. Represent. ICLR 2013 - Work. Track Proc., pp. 1-12, 2013.

60. [60] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, "A neural probabilistic language model," J. Mach. Learn. Res., vol. 3, no. Feb, pp. 1137-1155, 2003.

61. [61] A. B. Soliman, K. Eissa, and S. R. El-Beltagy, "AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP," Procedia Comput. Sci., vol. 117, pp. 256-265, 2017, doi: 10.1016/j.procs.2017.10.117. [DOI:10.1016/j.procs.2017.10.117]

62. [62] M. M. Fouad, A. Mahany, N. Aljohani, R. Ayaz, and A. S. Hassan, "ArWordVec : efficient word embedding models for Arabic tweets," Soft Comput., 2019, doi: 10.1007/s00500-019-04153-6. [DOI:10.1007/s00500-019-04153-6]

63. [63] J. Tissier et al., "Dict2vec : Learning Word Embeddings using Lexical Dictionaries To cite this version : HAL Id : ujm-01613953 Dict2vec : Learning Word Embeddings using Lexical Dictionaries," 2017.

64. [64] A. M. Alargrami and M. M. Eljazzar, "Imam: Word Embedding Model for Islamic Arabic NLP," 2nd Nov. Intell. Lead. Emerg. Sci. Conf. NILES 2020, pp. 520-524, 2020, doi: 10.1109/NILES50944.2020.9257931. [DOI:10.1109/NILES50944.2020.9257931]

65. [65] J. Pennington, R. Socher, and C. D. Manning, "GloVe: Global vectors for word representation," EMNLP 2014 - 2014 Conf. Empir. Methods Nat. Lang. Process. Proc. Conf., pp. 1532-1543, 2014, doi: 10.3115/v1/d14-1162. [DOI:10.3115/v1/D14-1162]

66. [66] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching Word Vectors with Subword Information," Trans. Assoc. Comput. Linguist., vol. 5, pp. 135-146, 2017, doi: 10.1162/tacl_a_00051. [DOI:10.1162/tacl_a_00051]

67. [67] M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger, "From word embeddings to document distances," in 32nd International Conference on Machine Learning, ICML 2015, 2015, vol. 2, pp. 957-966.

68. [68] T. Kenter and M. De Rijke, "Short text similarity with word embeddings," Int. Conf. Inf. Knowl. Manag. Proc., vol. 19-23-Oct-, pp. 1411-1420, 2015, doi: 10.1145/2806416.2806475. [DOI:10.1145/2806416.2806475]

69. [69] H. He and J. Lin, "Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement," in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 937-948, doi: 10.18653/v1/N16-1108. [DOI:10.18653/v1/N16-1108] [PMID] []

70. [70] Z. Wang, H. Mi, and A. Ittycheriah, "Sentence similarity learning by lexical decomposition and composition," COLING 2016 - 26th Int. Conf. Comput. Linguist. Proc. COLING 2016 Tech. Pap., no. challenge 2, pp. 1340-1349, 2016.

71. [71] E. Moatez, B. Nagoudi, D. Schwab, S. Similarity, E. Moatez, and B. Nagoudi, "Semantic Similarity of Arabic Sentences with Word Embeddings Embeddings," 2018.

72. [72] N. H. Tien, N. M. Le, Y. Tomohiro, and I. Tatsuya, "Sentence modeling via multiple word embeddings and multi-level comparison for semantic textual similarity," Inf. Process. Manag., vol. 56, no. 6, 2019, doi: 10.1016/j.ipm.2019.102090. [DOI:10.1016/j.ipm.2019.102090]

73. [73] A. Mahmoud and M. Zrigui, "Sentence Embedding and Convolutional Neural Network for Semantic Textual Similarity Detection in Arabic Language," Arab. J. Sci. Eng., vol. 44, no. 11, pp. 9263-9274, 2019, doi: 10.1007/s13369-019-04039-7. [DOI:10.1007/s13369-019-04039-7]

74. [74] S. Kim, I. Kang, and N. Kwak, "Semantic Sentence Matching with Densely-Connected Recurrent and Co-Attentive Information," Proc. AAAI Conf. Artif. Intell., vol. 33, no. February, pp. 6586-6593, Jul. 2019, doi: 10.1609/aaai.v33i01.33016586. [DOI:10.1609/aaai.v33i01.33016586]

75. [75] G. Chen, X. Shi, M. Chen, and L. Zhou, "Text similarity semantic calculation based on deep reinforcement learning," Int. J. Secur. Networks, vol. 15, no. 1, pp. 59-66, 2020, doi: 10.1504/IJSN.2020.106526. [DOI:10.1504/IJSN.2020.106526]

76. [76] Z. Sadat Hosseini Moghadam Emami, S. Tabatabayiseifi, M. Izadi, and M. Tavakoli, "Designing a Deep Neural Network Model for Finding Semantic Similarity between Short Persian Texts Using a Parallel Corpus," in 2021 7th International Conference on Web Research, ICWR 2021, 2021, pp. 91-96, doi: 10.1109/ICWR51868.2021.9443108. [DOI:10.1109/ICWR51868.2021.9443108]

77. [77] S. V. Moravvej, M. Joodaki, M. J. Maleki Kahaki, and M. Salimi Sartakhti, "A method Based on an Attention Mechanism to Measure the Similarity of two Sentences," in 2021 7th International Conference on Web Research, ICWR 2021, 2021, pp. 238-242, doi: 10.1109/ICWR51868.2021.9443135. [DOI:10.1109/ICWR51868.2021.9443135]

78. [78] A. Mahmoud and M. Zrigui, "BLSTM-API: Bi-LSTM Recurrent Neural Network-Based Approach for Arabic Paraphrase Identification," Arab. J. Sci. Eng., vol. 46, no. 4, pp. 4163-4174, 2021, doi: 10.1007/s13369-020-05320-w. [DOI:10.1007/s13369-020-05320-w]

79. [79] M. E. Peters et al., "Deep contextualized word representations," NAACL HLT 2018 - 2018 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. - Proc. Conf., vol. 1, pp. 2227-2237, 2018, doi: 10.18653/v1/n18-1202. [DOI:10.18653/v1/N18-1202]

80. [80] A. Radford, T. Narasimhan, T. Salimans, and I. Sutskever, "[GPT-1] Improving Language Understanding by Generative Pre-Training," Preprint, pp. 1-12, 2018, [Online]. Available: ahttps://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf.

81. [81] A. Vaswani et al., "Attention is all you need," Adv. Neural Inf. Process. Syst., vol. 2017-Decem, no. Nips, pp. 5999-6009, 2017.

82. [82] I. Solaiman et al., "Release strategies and the social impacts of language models," arXiv Prepr. arXiv1908.09203, 2019.

83. [83] G. I. Winata, A. Madotto, Z. Lin, R. Liu, J. Yosinski, and P. Fung, "Language Models are Few-shot Multilingual Learners," in Proceedings of the 1st Workshop on Multilingual Representation Learning, 2021, pp. 1-15, doi: 10.18653/v1/2021.mrl-1.1. [DOI:10.18653/v1/2021.mrl-1.1]

84. [84] N. Muennighoff, "SGPT : GPT Sentence Embeddings for Semantic Search," pp. 1-17, 2022.

85. [85] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," NAACL HLT 2019 - 2019 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. - Proc. Conf., vol. 1, no. Mlm, pp. 4171-4186, 2019.

86. [86] Y. Liu et al., "RoBERTa: A Robustly Optimized BERT Pretraining Approach," no. 1, 2019, [Online]. Available: https://aclanthology.org/2021.ccl-1.108.

87. [87] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter," pp. 2-6, 2019, [Online]. Available: http://arxiv.org/abs/1910.01108.

88. [88] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations," pp. 1-17, 2019, [Online]. Available: http://arxiv.org/abs/1909.11942.

89. [89] N. Peinelt, D. Nguyen, and M. Liakata, "tBERT: Topic Models and BERT Joining Forces for Semantic Similarity Detection," no. section 5, pp. 7047-7055, 2020, doi: 10.18653/v1/2020.acl-main.630. [DOI:10.18653/v1/2020.acl-main.630]

90. [90] H. Al-Theiabat and A. Al-Sadi, "The Inception Team at NSURL-2019 Task 8: Semantic Question Similarity in Arabic," 2020, [Online]. Available: http://arxiv.org/abs/2004.11964.

91. [91] M. Farahani, M. Gharachorloo, M. Farahani, and M. Manthouri, "ParsBERT: Transformer-based Model for Persian Language Understanding," Neural Process. Lett., vol. 53, no. 6, pp. 3831-3847, 2021, doi: 10.1007/s11063-021-10528-4. [DOI:10.1007/s11063-021-10528-4]

92. [92] W. Antoun, F. Baly, and H. Hajj, "AraBERT: Transformer-based Model for Arabic Language Understanding," 2020, [Online]. Available: http://arxiv.org/abs/2003.00104.

93. [93] G. Inoue, B. Alhafni, N. Baimukan, H. Bouamor, and N. Habash, "The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models," 2021, [Online]. Available: http://arxiv.org/abs/2103.06678.

94. [94] M. Abdul-Mageed, A. R. Elmadany, and E. M. B. Nagoudi, "ARBERT & MARBERT: Deep bidirectional transformers for Arabic," ACL-IJCNLP 2021 - 59th Annu. Meet. Assoc. Comput. Linguist. 11th Int. Jt. Conf. Nat. Lang. Process. Proc. Conf., no. i, pp. 7088-7105, 2021,

95. [95] A. Abdelali, N. Durrani, F. Dalvi, and H. Sajjad, "Interpreting Arabic Transformer Models," 2022.

96. [96] A. Alsaleh, E. Atwell, and A. Altahhan, "Quranic Verses Semantic Relatedness Using AraBERT," Proc. Sixth Arab. Nat. Lang. Process. Work., vol. 3, pp. 185-190, 2021.

97. [97] K. Lo, "SciBERT: A Pretrained Language Model for Scientific Text," 2019.

98. [98] J. Lee et al., "BioBERT: a pre-trained biomedical language representation model for biomedical text mining," Bioinformatics, vol. 36, no. 4, pp. 1234-1240, 2020.

99. [99] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, and I. Androutsopoulos, "LEGAL-BERT: The muppets straight out of law school," arXiv Prepr. arXiv2010.02559, 2020.

100. [100] F. Zhuang, F. Wei, H. Huang, L. Zhang, and Q. Zhang, "PromptBERT : Improving BERT Sentence Embeddings with Prompts," 2022.

101. [101] T. Gao, A. Fisch, and D. Chen, "Making Pre-trained Language Models Better Few-shot Learners," 2020.

102. [102] A. Neelakantan et al., "Text and Code Embeddings by Contrastive Pre-Training," 2022.

103. [103] H. Wang, Y. Li, Z. Huang, Y. Dou, L. Kong, and J. Shao, "SNCSE : Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples," 2022.

104. [104] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, "ERNIE: Enhanced Language Representation with Informative Entities," in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 1441-1451, doi: 10.18653/v1/P19-1139. [DOI:10.18653/v1/P19-1139]

105. [105] Y. Sun et al., "ERNIE 2.0: A continual pre-training framework for language understanding," AAAI 2020 - 34th AAAI Conf. Artif. Intell., pp. 8968-8975, 2020, doi: 10.1609/aaai.v34i05.6428. [DOI:10.1609/aaai.v34i05.6428]

106. [106] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, "XLNet: Generalized autoregressive pretraining for language understanding," Adv. Neural Inf. Process. Syst., vol. 32, no. NeurIPS, pp. 1-11, 2019.

107. [107] C. Raffel et al., "Exploring the limits of transfer learning with a unified text-to-text transformer," J. Mach. Learn. Res., vol. 21, 2020.

108. [108] Q. Le and T. Mikolov, "Distributed representations of sentences and documents," 31st Int. Conf. Mach. Learn. ICML 2014, vol. 4, pp. 2931-2939, 2014.

109. [109] S. Akef, M. H. Bokaei, and H. Sameti, "Training Doc2Vec on a Corpus of Persian Poems to Answer Thematic Similarity Multiple-Choice Questions," in 2020 10th International Symposium on Telecommunications: Smart Communications for a Better Life, IST 2020, 2020, pp. 146-149, doi: 10.1109/IST50524.2020.9345918. [DOI:10.1109/IST50524.2020.9345918]

110. [110] M. Alshammeri, E. Atwell, and M. A. Alsalka, "Detecting Semantic-based Similarity between Verses of the Quran with Doc2vec," Procedia CIRP, vol. 189, pp. 351-358, 2021,

111. [111] A. M. Abdelghany, H. M. Abdelaal, A. M. Kamr, and P. M. Elkafrawy, "Doc2Vec: An approach to identify Hadith Similarities," Aust. J. Basic Appl. Sci., no. March, pp. 46-53, 2020, doi: 10.22587/ajbas.2020.14.12.5.

112. [112] R. Kiros et al., "Skip-thought vectors," Adv. Neural Inf. Process. Syst., vol. 2015-Janua, no. 786, pp. 3294-3302, 2015.

113. [113] F. Hill, K. Cho, and A. Korhonen, "Learning distributed representations of sentences from unlabelled data," 2016 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. NAACL HLT 2016 - Proc. Conf., pp. 1367-1377, 2016, doi: 10.18653/v1/n16-1162. [DOI:10.18653/v1/N16-1162] [PMID]

114. [114] F. Hill, K. Cho, A. Korhonen, and Y. Bengio, "Learning to Understand Phrases by Embedding the Dictionary," Trans. Assoc. Comput. Linguist., vol. 4, no. April, pp. 17-30, 2016, doi: 10.1162/tacl_a_00080. [DOI:10.1162/tacl_a_00080]

115. [115] M. Pagliardini, P. Gupta, and M. Jaggi, "Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features," in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018,

116. [116] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, "Supervised learning of universal sentence representations from natural language inference data," EMNLP 2017 - Conf. Empir. Methods Nat. Lang. Process. Proc., pp. 670-680, 2017, doi: 10.18653/v1/d17-1070. [DOI:10.18653/v1/D17-1070]

117. [117] S. Subramanian, A. Trischler, Y. Bengio, and C. J. Pal, "Learning general purpose distributed sentence representations via large scale multitask learning," 6th Int. Conf. Learn. Represent. ICLR 2018 - Conf. Track Proc., no. 2016, pp. 1-16, 2018.

118. [118] D. Cer et al., "Universal sentence encoder for English," EMNLP 2018 - Conf. Empir. Methods Nat. Lang. Process. Syst. Demonstr. Proc., pp. 169-174, 2018, doi: 10.18653/v1/d18-2029. [DOI:10.18653/v1/D18-2029]

119. [119] N. Reimers and I. Gurevych, "Sentence-BERT: Sentence embeddings using siamese BERT-networks," EMNLP-IJCNLP 2019 - 2019 Conf. Empir. Methods Nat. Lang. Process. 9th Int. Jt. Conf. Nat. Lang. Process. Proc. Conf.,

120. [120] H. Tsukagoshi, R. Sasano, and K. Takeda, "Comparison and Combination of Sentence Embeddings Derived from Different Supervision Signals," 2022.

121. [121] H. Tsukagoshi, "DefSent : Sentence Embeddings using Definition Sentences," pp. 411-418, 2021.

122. [122] D. Bär, C. Biemann, I. Gurevych, and T. Zesch, "Ukp: Computing semantic textual similarity by combining multiple content similarity measures," in * SEM 2012: The First Joint Conference on Lexical and Computational Semantics, 2012, pp. 435-440.

123. [123] F. Saríc, G. Glavaš, M. Karan, J. Šnajder, and B. D. Bašíc, "TakeLab: Systems for measuring semantic text similarity," *SEM 2012 - 1st Jt. Conf. Lex. Comput. Semant., vol. 2, no. January, pp. 441-448, 2012.

124. [124] M. T. Pilehvar, D. Jurgens, and R. Navigli, "Align, disambiguate and walk: A unified approach for measuring semantic similarity," ACL 2013 - 51st Annu. Meet. Assoc. Comput. Linguist. Proc. Conf., vol. 1, pp. 1341-1351, 2013.

125. [125] A. Severyn, M. Nicosia, and A. Moschitti, "Learning semantic textual similarity with structural representations," ACL 2013 - 51st Annu. Meet. Assoc. Comput. Linguist. Proc. Conf., vol. 2, pp. 714-718, 2013.

126. [126] K. S. Tai, R. Socher, and C. D. Manning, "Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks," in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015, vol. 1, pp. 1556-1566,

127. [127] H. He, K. Gimpel, and J. Lin, "Multi-perspective sentence similarity modeling with convolutional neural networks," Conf. Proc. - EMNLP 2015 Conf. Empir. Methods Nat. Lang. Process., no. September, pp. 1576-1586, 2015,

128. [128] M. A. Sultan, S. Bethard, and T. Sumner, "DLS@CU: Sentence Similarity from Word Alignment and Semantic Vector Composition," no. SemEval,

129. [129] J. Tian, Z. Zhou, M. Lan, and Y. Wu, "ECNU at SemEval-2017 Task 1: Leverage Kernel-based Traditional NLP features and Neural Networks to Build a Universal Model for Multilingual and Cross-lingual Semantic Textual Similarity," pp. 191-197, 2018,

130. [130] W. Wali, B. Gargouri, and A. Ben Hamadou, "Enhancing the sentence similarity measure by semantic and syntactico-semantic knowledge," Vietnam J. Comput. Sci., vol. 4, no. 1, pp. 51-60, 2016,

131. [131] B. Hassan, S. E. Abdelrahman, R. Bahgat, and I. Farag, "UESTS: An Unsupervised Ensemble Semantic Textual Similarity Method," IEEE Access, vol. 7, pp. 85462-85482, 2019,

132. [132] I. Lopez-Gazpio, M. Maritxalar, M. Lapata, and E. Agirre, "Word n-gram attention models for sentence similarity and inference," Expert Syst. Appl., vol. 132, pp. 1-11, 2019,

133. [133] E. Inan, "SimiT: A Text Similarity Method Using Lexicon and Dependency Representations," New Gener. Comput., vol. 38, no. 3, pp. 509-530, 2020,

134. [134] R. Speer, J. Chin, and C. Havasi, "ConceptNet 5.5: An Open Multilingual Graph of General Knowledge," no. Singh 2002, pp. 4444-4451, 2016, [Online]. Available: http://arxiv.org/abs/1612.03975.

135. [135] J. Luo et al., "Exploiting Syntactic and Semantic Information for Textual Similarity Estimation," Math. Probl. Eng., vol. 2021, 2021, doi: 10.1155/2021/4186750. [DOI:10.1155/2021/4186750]

136. [136] M. Farouk, "Measuring text similarity based on structure and word embedding," Cogn. Syst. Res., vol. 63, pp. 1-10, 2020,

137. [137] F. Alam, M. Afzal, and K. M. Malik, "Comparative Analysis of Semantic Similarity Techniques for Medical Text," in International Conference on Information Networking, 2020,

138. [138] G. Majumder, "Interpretable semantic textual similarity of sentences using alignment of chunks with classification and regression," no. March, 2021,

139. [139] M. W. Bauer and B. Aarts, "Corpus construction: A principle for qualitative data collection," Qual. Res. with text, image sound A Pract. Handb., pp. 19-37, 2000.

140. [140] A. O'Keeffe and M. McCarthy, The Routledge handbook of corpus linguistics, vol. 10. Routledge London, 2010.

141. [141] S. Atkins, J. Clear, and N. Ostler, "Corpus design criteria," Lit. Linguist. Comput., vol. 7, no. 1, pp. 1-16, 1992.

142. [142] R. Artstein and M. Poesio, "Survey Article Inter-Coder Agreement for Computational Linguistics," no. August 2005, 2008.

143. [143] J. Pustejovsky and A. Stubbs, Natural Language Annotation for Machine Learning: A guide to corpus-building for applications. " O'Reilly Media, Inc.," 2012.

144. [144] M. Lombard, J. Snyder-duch, and C. C. Bracken, "Practical Resources for Assessing and Reporting Intercoder Reliability in Content Analysis Research Projects," no. January, 2005.

145. [145] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, "SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation," pp. 1-14, 2018, doi: 10.18653/v1/s17-2001. [DOI:10.18653/v1/S17-2001]

146. [146] A. Conneau and D. Kiela, "SentEval: An evaluation toolkit for universal sentence representations," Lr. 2018 - 11th Int. Conf. Lang. Resour. Eval., pp. 1699-1704, 2019.

147. [147] A. Bistacchi, S. Mittempergher, M. Martinelli, and F. Storti, "On a new robust workflow for the statistical and spatial analysis of fracture data collected with scanlines" Solid Earth, vol. 11, no. 6, pp. 2535-2547, 2020.

148. [148] H. Rubenstein and J. B. Goodenough, "Contextual correlates of synonymy," Commun. ACM, vol. 8, no. 10, pp. 627-633, 1965.

149. [149] W. B. Dolan and C. Brockett, "Automatically constructing a corpus of sentential paraphrases," 2005.

150. [150] M. Marelli et al., "A SICK cure for the evaluation of compositional distributional semantic models.," in Lrec, 2014, pp. 216-223.

151. [151] F. Mashhadirajab, M. Shamsfard, R. Adelkhah, F. Shafiee, and C. Saedi, "A text alignment corpus for Persian plagiarism detection," in CEUR Workshop Proceedings, 2016, vol. 1737, pp. 184-189.

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote