دوره 18، شماره 1 - ( 3-1400 )                   جلد 18 شماره 1 صفحات 119-134 | برگشت به فهرست نسخه ها


XML English Abstract Print


Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

Soltanian M, Ghaemmaghami S. Recognition of Visual Events using Spatio-Temporal Information of the Video Signal. JSDP 2021; 18 (1) :134-119
URL: http://jsdp.rcisp.ac.ir/article-1-928-fa.html
سلطانیان محمد، قائم‌مقامی شاهرخ. تشخیص وقایع بصری به‌کمک اطلاعات مکانی-زمانی سیگنال ویدئو. پردازش علائم و داده‌ها. 1400; 18 (1) :134-119

URL: http://jsdp.rcisp.ac.ir/article-1-928-fa.html


دانشگاه صنعتی شریف
چکیده:   (2147 مشاهده)
در این مقاله، ‏تشخیص وقایع بصری در ویدئو، با بهره­‌گیری از اطلاعات زمانی سیگنال، به‌صورت تحلیلی موردتوجه قرار دارد. با استفاده از یادگیری انتقالی‏، توصیف‌گرهای آموزش‌دیده روی تصاویر به ویدئو اعمال می‌شوند تا تشخیص وقایع را با استفاده از منابع محاسباتی محدود‏، ممکن سازند. ‏در این مقاله، یک شبکه عصبی کانولوشنی به‌عنوان استخراج‌کننده نمرات مفاهیم از قاب‌‌های ویدئو به‌کار می‌رود‏. ابتدا پارامترهای این شبکه روی زیرمجموعه‌ای از داده‌های آموزش تنظیم ‏دقیق می‌شوند؛ سپس، توصیف‌گرهای خروجی از لایه‌های تمام‌متصل آن به‌عنوان توصیف‌گر سطح قاب مورداستفاده قرار می‌گیرند. توصیف‌گرهای به‌دست‌آمده، کدگذاری و در‌نهایت نرمالیزه‌سازی و طبقه‌بندی می‌شوند. نوآوری عمده این مقاله‏، ترکیب اطلاعات زمانی ویدئو در کدگذاری توصیف‌گرهای آن است. گنجاندن ساختاری اطلاعات بصری در فرایند کدگذاری توصیف‌گرهای ویدئویی،‏، اغلب نادیده گرفته می‌شود. این موضوع به کاهش دقت منجر می­‌شود. برای حل این مسأله، یک روش کدگذاری نوین ارائه می‌شود که مصالحه بین پیچیدگی محاسبات و دقت در شناسایی وقایع ویدیویی را بهبود می­‌دهد. در این کدگذاری‏، بعد زمانی سیگنال ویدئویی برای ساخت یک بردار مکانی-زمانی از توصیف‌گرهای مجتمع محلی (‎‎‎VLAD) استفاده، سپس نشان داده می‌شود که کدگذاری پیشنهادی ماهیتاً یک مسأله بهینه‌سازی است که با الگوریتم‌های موجود به‌راحتی قابل‌حل است. در مقایسه با بهترین روش‌های موجود در حوزه تشخیص وقایع بصری مبتنی بر توصیف‌گرهای سطح قاب‏، روش پیشنهادی مدل بهتری را از ویدئو ارائه می‌کند. روش ارائه‌شده بر حسب سه معیار میانگین دقت متوسط، میانگین فراخوانی متوسط و معیار F به عملکرد بالاتری بر روی هر دو مجموعه‌‌‌داده آزمون مورد بررسی دست می‌یابد. نتایج به‌دست‌آمده توانمندی روش پیشنهادی را در بهبود عملکرد سامانه‌های تشخیص وقایع بصری تأیید می‌کنند.
متن کامل [PDF 1403 kb]   (1078 دریافت)    
نوع مطالعه: بنیادی | موضوع مقاله: مقالات پردازش تصویر
دریافت: 1397/8/21 | پذیرش: 1397/11/30 | انتشار: 1400/3/1 | انتشار الکترونیک: 1400/3/1

فهرست منابع
1. [1] M. Mohseni and M. Seriani, "Pedestrian Detection in Infrared Image Sequences Using SVM and Histogram Classifiers," JSDP, vol. 6, no. 1, pp. 79-90, 2009.
2. [1] محسن، سریانی محسن. تشخیص عابر پیاده با استفاده از کلاس بندهای SVM و هیستوگرام در توالی تصاویر مادون قرمز. پردازش علائم و داده‌ها. ۱۳۸۸; ۶ (۱).
3. [2] S. Shafeipour Yourdeshahi, H. Seyedarabi, and A. Aghagolzadeh, "Video based Face Recognition Using Orthogonal Locality Preserving Projection," JSDP, vol. 13, no. 2, pp. 139-149, Sep. 2016.
4. [2] شفیع‌پور یوردشاهی سجاد، سید‌عربی هادی، آقاگل‌زاده علی. شناسایی چهره در رشته‌های ویدیویی با استفاده از افکنش متعامد با حفظ ساختار محلی . پردازش علائم و داده‌ها. ۱۳۹۵; ۱۳ (۲) :۱۳۹-۱۴۹.
5. [3] Z. Xu, Y. Yang, and A. G. Hauptmann, "A discriminative CNN video representation for event detection," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1798-1807, Accessed: Apr. 08, 2016. [Online]. [DOI:10.1109/CVPR.2015.7298789] [PMCID]
6. [4] L. Wang, C. Gao, J. Liu, and D. Meng, "A novel learning-based frame pooling method for event detection," Signal Processing, vol. 140, pp. 45-52, 2017. [DOI:10.1016/j.sigpro.2017.05.005]
7. [5] S. Kwak, B. Han, and J. H. Han, "Scenario-based video event recognition by constraint flow," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 3345-3352. [DOI:10.1109/CVPR.2011.5995435]
8. [6] Y. Cong, J. Yuan, and J. Luo, "Towards scalable summarization of consumer videos via sparse dictionary selection," IEEE Transactions on Multimedia, vol. 14, no. 1, pp. 66-75, 2012. [DOI:10.1109/TMM.2011.2166951]
9. [7] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2005, vol. 1, pp. 886-893.
10. [8] D. G. Lowe, "Distinctive image features from scale-invariant keypoints," International journal of computer vision, vol. 60, no. 2, pp. 91-110, 2004. [DOI:10.1023/B:VISI.0000029664.99615.94]
11. [9] K. E. Van De Sande, T. Gevers, and C. G. Snoek, "Evaluating color descriptors for object and scene recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1582-1596, 2010. [DOI:10.1109/TPAMI.2009.154] [PMID]
12. [10] I. Laptev, "On space-time interest points," International Journal of Computer Vision, vol. 64, no. 2, pp. 107-123, 2005. [DOI:10.1007/s11263-005-1838-7]
13. [11] M. Chen and A. Hauptmann, "Mosift: Recognizing human actions in surveillance videos," Carnegie Mellon University, Technical Report CMU-CS-09-161, 2009.
14. [12] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, "Action recognition by dense trajectories," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2011, pp. 3169-3176. [DOI:10.1109/CVPR.2011.5995407]
15. [13] H. Wang and C. Schmid, "Action recognition with improved trajectories," in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Dec. 2013, pp. 3551-3558. [DOI:10.1109/ICCV.2013.441] [PMCID]
16. [14] D. Oneata, J. Verbeek, and C. Schmid, "Action and event recognition with Fisher vectors on a compact feature set," in Proceedings of the IEEE International Conference on Computer Vision, Dec. 2013, pp. 1817-1824. [DOI:10.1109/ICCV.2013.228]
17. [15] F. Metze, S. Rawat, and Y. Wang, "Improved audio features for large-scale multimedia event detection," in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Jul. 2014, pp. 1-6. [DOI:10.1109/ICME.2014.6890234]
18. [16] M.-L. Shyu, Z. Xie, M. Chen, and S.-C. Chen, "Video semantic event/concept detection using a subspace-based multimedia data mining framework," IEEE Transactions on Multimedia, vol. 10, no. 2, pp. 252-259, 2008. [DOI:10.1109/TMM.2007.911830]
19. [17] V. S. Tseng, J.-H. Su, J.-H. Huang, and C.-J. Chen, "Integrated mining of visual features, speech features, and frequent patterns for semantic video annotation," IEEE Transactions on Multimedia, vol. 10, no. 2, pp. 260-267, 2008. [DOI:10.1109/TMM.2007.911832]
20. [18] X. Peng and C. Schmid, "Encoding feature maps of cnns for action recognition," presented at the CVPR, THUMOS Challenge Workshop, 2015, Accessed: Apr. 08, 2016. [Online].
21. [19] R. Baradaran, E.Golpar-Raboki, "Feature Extraction and Efficiency Comparison Using Dimension Reduction Methods in Sentiment Analysis Context", JSDP, 2019, vol. 16 (3), pp. 88-79. [DOI:10.29252/jsdp.16.3.88]
22. [19] برادران راضیه، گلپر رابوکی عفت. استخراج ویژگی‌ و بررسی کارآیی روش‌های کاهش بُعد در زمینه تحلیل احساس. پردازش علائم و داده‌ها. ۱۳۹۸; ۱۶ (۳) :۸۸-۷۹
23. [20] F. Sherafati, J.Tahmoresnezhad, "Image Classification via Sparse Representation and Subspace Alignment", JSDP, 2020, vol.17 (2) , pp.58-47 [DOI:10.29252/jsdp.17.2.58]
24. [20] شرافتی فریماه، طهمورث نژاد جعفر. طبقه‌بندی تصاویر با استفاده از نمایش تُنُک و تطبیق زیرفضا. پردازش علائم و داده‌ها. ۱۳۹۹; ۱۷ (۲) :۵۸-۴۷.
25. [21] R. Aly et al., "The AXES submissions at TrecVid 2013," 2013.
26. [22] Y. Shi, Y. Tian, Y. Wang, and T. Huang, "Sequential deep trajectory descriptor for action recognition with three-stream CNN," IEEE Transactions on Multimedia, vol. 19, no. 7, pp. 1510-1520, 2017. [DOI:10.1109/TMM.2017.2666540]
27. [23] P. Scovanner, S. Ali, and M. Shah, "A 3-dimensional sift descriptor and its application to action recognition," in Proceedings of the 15th ACM international conference on Multimedia, Sep. 2007, pp. 357-360. [DOI:10.1145/1291233.1291311]
28. [24] A. Klaser, M. Marszałek, and C. Schmid, "A spatio-temporal descriptor based on 3d-gradients," in Proceedings of the 19th British Machine Vision Conference (BMVC), 2008, pp. 99.1-99.10. [DOI:10.5244/C.22.99]
29. [25] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998. [DOI:10.1109/5.726791]
30. [26] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2009, pp. 248-255. [DOI:10.1109/CVPR.2009.5206848]
31. [27] M. Momeny, M .A Sarram, A. Latif, R. Sheikhpour, "A Convolutional Neural Network based on Adaptive Pooling for Classification of Noisy Images", JSDP, 2021, vol.17 (4), pp.139-154. [DOI:10.29252/jsdp.17.4.139]
32. [27] مومنی محمد، صرام مهدی آقا، لطیف علی محمد، شیخ‌پور راضیه. ارائه یک شبکه عصبی کانولوشنال مبتنی بر ادغام تطبیقی پویا برای طبقه‌بندی تصاویر نوفه‌ای. پردازش علائم و داده‌ها. ۱۳۹۹; ۱۷ (۴) :۱۳۹-۱۵۴
33. [28] D. Oneata, J. Verbeek, and C. Schmid, "The LEAR submission at Thumos 2014," 2014.
34. [29] S. Zha, F. Luisier, W. Andrews, N. Srivastava, and R. Salakhutdinov, "Exploiting image-trained cnn architectures for unconstrained video classification," in Proceedings of the 26th British Machine Vision Conference, 2015, p. 60.1-60.13. [DOI:10.5244/C.29.60]
35. [30] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Proceedings of the Advances in neural information processing systems, Jan. 2012, pp. 1097-1105.
36. [31] J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei, "ILSVRC-2012, 2012," 2012, [Online]. Available: http://www. image-net. org/challenges/LSVRC.
37. [32] M. D. Zeiler and R. Fergus, "Visualizing and understanding convolutional networks," in Proceedings of the European Conference on Computer Vision, 2014, pp. 818-833. [DOI:10.1007/978-3-319-10590-1_53]
38. [33] V. Nair and G. E. Hinton, "Rectified linear units improve restricted boltzmann machines," in Proceedings of the 27th International Conference on Machine Learning (ICML), 2010, pp. 807-814.
39. [34] M. Soltanian and S. Ghaemmaghami, "Hierarchical Concept Score Post-processing and Concept-wise Normalization in CNN based Video Event Recognition," IEEE Transactions on Multimedia, vol. 21, no. 1, pp. 157-172, 2019. [DOI:10.1109/TMM.2018.2844101]
40. [35] Y. Han, X. Wei, X. Cao, Y. Yang, and X. Zhou, "Augmenting image descriptions using structured prediction output," IEEE Transactions on Multimedia, vol. 16, no. 6, pp. 1665-1676, 2014. [DOI:10.1109/TMM.2014.2321530]
41. [36] H. Jégou, M. Douze, C. Schmid, and P. Pérez, "Aggregating local descriptors into a compact image representation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 3304-3311. [DOI:10.1109/CVPR.2010.5540039]
42. [37] Z. Zhao, Y. Song, and F. Su, "Specific video identification via joint learning of latent semantic concept, scene and temporal structure," Neurocomputing, vol. 208, pp. 378-386, 2016. [DOI:10.1016/j.neucom.2016.06.002]
43. [38] F. Perronnin and C. Dance, "Fisher kernels on visual vocabularies for image categorization," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007, pp. 1-8, Accessed: Jul. 06, 2016. [Online]. [DOI:10.1109/CVPR.2007.383266]
44. [39] F. Markatopoulou et al., "ITI-CERTH participation to TRECVID 2013," in TRECVID 2013 Workshop, 2013, pp. 12-17.
45. [40] C. Sun and R. Nevatia, "Large-scale web video event classification by use of fisher vectors," in IEEE Workshop on Applications of Computer Vision (WACV), 2013, pp. 15-22, Accessed: Jul. 06, 2016. [Online]. [DOI:10.1109/WACV.2013.6474994]
46. [41] R. Arandjelovic and A. Zisserman, "All about VLAD," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1578-1585, Accessed: Jul. 06, 2016. [Online]. [DOI:10.1109/CVPR.2013.207]
47. [42] J. Delhumeau, P.-H. Gosselin, H. Jégou, and P. Pérez, "Revisiting the vlad image representation," in Proceedings of the 21st ACM international conference on Multimedia, Oct. 2013, pp. 653-656. [DOI:10.1145/2502081.2502171]
48. [43] G. Tolias, Y. Avrithis, and H. Jegou, "To Aggregate or Not to aggregate: Selective Match Kernels for Image Search," in Proceedings of the IEEE International Conference on Computer Vision, Dec. 2013, pp. 1401-1408, doi: 10.1109/ICCV.2013.177. [DOI:10.1109/ICCV.2013.177]
49. [44] Y.-L. Boureau, J. Ponce, and Y. LeCun, "A theoretical analysis of feature pooling in visual recognition," in Proceedings of the 27th international conference on machine learning (ICML), Jun. 2010, pp. 111-118.
50. [45] T. De Campos, G. Csurka, and F. Perronnin, "Images as sets of locally weighted features," Computer Vision and Image Understanding, vol. 116, no. 1, pp. 68-85, 2012. [DOI:10.1016/j.cviu.2011.07.011]
51. [46] N. Murray and F. Perronnin, "Generalized Max Pooling," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2014, pp. 2473-2480, doi: 10.1109/CVPR.2014.317. [DOI:10.1109/CVPR.2014.317]
52. [47] T. Ge, Q. Ke, and J. Sun, "Sparse-Coded Features for Image Retrieval.," in Proceedings of the British Machine Vision Conference (BMVC), 2013, pp. 1-11, Accessed: Nov. 10, 2016. [Online]. [DOI:10.5244/C.27.132]
53. [48] H. Jégou and O. Chum, "Negative evidences and co-occurrences in image retrieval: the benefit of PCA and whitening," in Proceedings of the European Conference on Computer Vision (ECCV), 2012, pp. 774-787, Accessed: Nov. 10, 2016. [Online]. [DOI:10.1007/978-3-642-33709-3_55]
54. [49] M. K. Reddy, S. Arora, and R. V. Babu, "Spatio-temporal feature based VLAD for efficient video retrieval," in Proceedings of 4th National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, 2013, pp. 1-4, Accessed: Nov. 10, 2016. [Online]. [DOI:10.1109/NCVPRIPG.2013.6776268]
55. [50] M. Jain, H. Jegou, and P. Bouthemy, "Better Exploiting Motion for Better Action Recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2013, pp. 2555-2562, doi: 10.1109/CVPR.2013.330. [DOI:10.1109/CVPR.2013.330]
56. [51] M. Douze, H. Jégou, C. Schmid, and P. Pérez, "Compact video description for copy detection with precise temporal alignment," in Proceedings of the European Conference on Computer Vision, 2010, pp. 522-535, Accessed: Nov. 10, 2016. [Online]. [DOI:10.1007/978-3-642-15549-9_38]
57. [52] A. Abbas, N. Deligiannis, and Y. Andreopoulos, "Vectors of locally aggregated centers for compact video representation," in 2015 IEEE International Conference on Multimedia and Expo (ICME), 2015, pp. 1-6, Accessed: Nov. 07, 2016. [Online]. [DOI:10.1109/ICME.2015.7177501]
58. [53] J. Revaud, M. Douze, C. Schmid, and H. Jegou, "Event Retrieval in Large Video Collections with Circulant Temporal Encoding," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2013, pp. 2459-2466, doi: 10.1109/CVPR.2013.318. [DOI:10.1109/CVPR.2013.318]
59. [54] M. Grant and S. Boyd, CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx, 2014.
60. [55] Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis, and A. C. Loui, "Consumer video understanding: A benchmark database and an evaluation of human and machine performance," in Proceedings of the 1st ACM International Conference on Multimedia Retrieval, 2011, pp. 29.1-29.8. [DOI:10.1145/1991996.1992025]
61. [56] "Pretrained CNNs - MatConvNet," 2017, Accessed: Jun. 12, 2017. [Online]. Available: http://www.vlfeat.org/matconvnet/pretrained/.
62. [57] C.-C. Chang and C.-J. Lin, "LIBSVM: a library for support vector machines," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, pp. 27.1-27.27, 2011. [DOI:10.1145/1961189.1961199]
63. [58] "Matlab VideoUtils," SourceForge, 2015. https://sourceforge.net/projects/videoutils/ (accessed May 29, 2016).
64. [59] A. Vedaldi and K. Lenc, "MatConvNet: Convolutional neural networks for matlab," in Proceedings of the 23rd Annual ACM Conference on Multimedia, Oct. 2015, pp. 689-692. [DOI:10.1145/2733373.2807412]
65. [60] A. Vedaldi and B. Fulkerson, "VLFeat: An open and portable library of computer vision algorithms," in Proceedings of the 18th ACM International Conference on Multimedia, 2010, pp. 1469-1472. [DOI:10.1145/1873951.1874249]
66. [61] Y.-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S.-F. Chang, "Super Fast Event Recognition in Internet Videos," IEEE Transactions on Multimedia, vol. 17, no. 8, pp. 1174-1186, 2015. [DOI:10.1109/TMM.2015.2436813]
67. [62] P. Napoletano, "Visual descriptors for content-based retrieval of remote-sensing images," International Journal of Remote Sensing, vol. 39, no. 5, pp. 1343-1376, 2018. [DOI:10.1080/01431161.2017.1399472]
68. [63] C. Goutte and E. Gaussier, "A probabilistic interpretation of precision, recall and F-score, with implication for evaluation," in Proceedings of the European Conference on Information Retrieval, 2005, pp. 345-359. [DOI:10.1007/978-3-540-31865-1_25]
69. [64] F. Perronnin, J. Sánchez, and T. Mensink, "Improving the fisher kernel for large-scale image classification," in Proceedings of the European Conference on Computer Vision (ECCV), Sep. 2010, pp. 143-156. [DOI:10.1007/978-3-642-15561-1_11]
70. [65] Z. Xu, Y. Yang, I. Tsang, N. Sebe, and A. G. Hauptmann, "Feature weighting via optimal thresholding for video analysis," in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3440-3447. [DOI:10.1109/ICCV.2013.427]
71. [66] A. J. Ma and P. C. Yuen, "Reduced analytic dependency modeling: Robust fusion for visual recognition," International journal of computer vision, vol. 109, no. 3, pp. 233-251, 2014. [DOI:10.1007/s11263-014-0723-7]
72. [67] G. Ye, D. Liu, I.-H. Jhuo, and S.-F. Chang, "Robust late fusion with rank minimization," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 3021-3028.
73. [68] I.-H. Jhuo et al., "Discovering joint audio-visual codewords for video event detection," Machine vision and applications, vol. 25, no. 1, pp. 33-47, 2014. [DOI:10.1007/s00138-013-0567-0]
74. [69] D. Liu, K.-T. Lai, G. Ye, M.-S. Chen, and S.-F. Chang, "Sample-specific late fusion for visual category recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 803-810. [DOI:10.1109/CVPR.2013.109]
75. [70] Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, and X. Xue, "Exploring inter-feature and inter-class relationships with deep neural networks for video classification," in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 167-176. [DOI:10.1145/2647868.2654931]

ارسال نظر درباره این مقاله : نام کاربری یا پست الکترونیک شما:
CAPTCHA

ارسال پیام به نویسنده مسئول


بازنشر اطلاعات
Creative Commons License این مقاله تحت شرایط Creative Commons Attribution-NonCommercial 4.0 International License قابل بازنشر است.

کلیه حقوق این تارنما متعلق به فصل‌نامة علمی - پژوهشی پردازش علائم و داده‌ها است.